Sentiment and Emotion Classification of Transcribed Call Center Phone Calls

(1)

Sentiment and Emotion Classification of

Transcribed Call Center Phone Calls

Elvira Slaghekke Master thesis Information Science Elvira Slaghekke S3209695 February 26, 2021

(2)

A B S T R A C T

For my master’s thesis, I performed sentiment analysis on call center phone calls. I tried to investigate which sentiments and emotions are related to different topics or words in these con-versations. I used a data set containing Dutch, transcribed call center conversations, divided over 4379 utterances. I built two classifier models that predict the sentiment and emotion labels of the utterances. To reach the best performing models, I experimented with different n-gram, label and lexicon-based features. The sentiment classifier reached an accuracy of 0.877. The emotion classifier scored an accuracy of 0.694. Both classifiers outperform the majority baseline. Next to predicting the sentiment and emotion labels, I analysed the most informative features per class to get an idea of what words are important indicators for the different classes. Unfortunately, I could not draw clear conclusions from these features. The biggest limitation of this research is the relatively small data set with a skewed distribution of the classes.

(3)

C O N T E N T S

Abstract i Preface 1 1 introduction 2 2 background 4 3 data 8 3.1 Collection . . . 8 3.2 Summary . . . 9 3.3 Splitting . . . 9 4 method 11 4.1 Algorithm . . . 11 4.2 Vectorizer . . . 12 4.3 Data processing . . . 12 4.4 Features . . . 14 4.5 Hyperparameters . . . 16 4.6 Fasttext . . . 18 4.7 Evaluation . . . 18

5 results & discussion 20 5.1 Sentiment classification . . . 20

5.2 Emotion classification . . . 23

6 conclusion 27 6.1 Answering the research questions . . . 27

6.2 Limitations and recommendations for future research . . . 28

a detailed results 32 a.1 N-gram feature results . . . 32

a.2 Lexicon & label feature results . . . 32

a.3 Hyperparameter results . . . 33

a.4 Fasttext optimal parameter values . . . 33

(4)

P R E F A C E

Before you lies my Master’s Thesis Sentiment and Emotion Classification of Transcribed Call Cen-ter Phone Calls. I’ve written this thesis to fulfill the graduation requirements of the MasCen-ter Communication & Information Studies (Information Science) at the University of Groningen.

I conducted this research in collaboration with Datanext, the company I work for. I would like to thank all my colleagues at Datanext for their help, mental support and good company during the writing process of my thesis. I would like to express special thanks to my supervi-sor Gosse Bouma for his good feedback and guidance throughout the process. The completion of this thesis marks the end of my time as a student here in Groningen. A time which i will look back on with nostalgia, where i learned a lot and had a lot of fun.

Elvira Slaghekke

Groningen, February 26th, 2021

(5)

1

_{I N T R O D U C T I O N}

Call centers are often the main channel through which corporations communicate with their customers. Agents interact with customers on behalf of an organization, becoming the key role in maintaining brand reputation and customer experience. Understanding customer needs, opinions, and expectations is really important for customer-centric enterprises. Also, as cus-tomer satisfaction is strongly correlated with profitability (Homburg and Giering,2001),

im-proving customer satisfaction is an important step towards increasing profits. Therefore, orga-nizations strive to develop techniques and tools to help them identify issues that bother their customers.

Unlike measuring profit or number of sales, it is very hard to objectively measure customer satisfaction. Most call centers conduct a manual survey with a small group of customers to measure customer satisfaction. These surveys are often very limited. The survey size is typically small, so the conclusions drawn from the survey are not very reliable. Moreover, surveys are typically conducted after a campaign is finished. Therefore, it is often too late to change the approach if the results are disappointing.

Sentiment analysis is a method that could contribute to measuring customer satisfaction. Sentiment analysis is a sub-field of Natural Language Processing (NLP) which refers to the inference of people’s views, positions and attitudes in their written or spoken texts. Nowadays, the field is rapidly evolving due to the rise of platforms with user generated content such as blogs, social media and online reviews. A lot of work exists on the analysis of sentiment in social media platforms such as Twitter. But sentiment analysis can also be used well for commercial purposes. By analyzing its call center conversations, a corporation could identify and approach dissatisfied customers, assess their attitudes towards services and products and identify problems at their early stages to resolve them and improve customer experience (Katz et al.,2015).

For my master thesis, I am going to perform sentiment analysis on call center conversations. In previous research in this field, often the only goal is to measure customer satisfaction. In this research, I want to lay the focus on analysing the results instead of on only predicting sentiment and emotion. I want to know where it goes wrong, instead of if it goes wrong. Therefore, I want to investigate which sentiment and emotions are related to different topics or words in these conversations. With this information, the call center could for example change the scripts that are used by the agents as a guideline for the call. Words or topics that are associated with negative emotions could be avoided and topics that are associated with positive emotions could be increased. This could aid the ultimate goal of increasing customer satisfaction and eventually the number of sales.

Because I want to know which topics are associated with different sentiment and emotions, and not intonation for instance, I am only going to use the textual features of the phone calls. A condition for this is that the conversations must be transcribed. The data set that I use consists of transcribed Dutch call center phone calls. These are outbound calls, where the goal is to convince a customer of buying a newspaper subscription or donating money to a good cause. Expressed sentiment or emotion can change during a call. So, instead of classifying a call as a whole, the sentiment and emotion of each utterance in the conversation will be predicted.

(6)

introduction 3

With this research, I will attempt to answer the following two research questions:

1. How can we predict sentiment and emotion of utterances from transcribed call center conversations using machine learning?

2. Which words or topics are most indicative of positive emotions, negative emotions and more fine-grained emotion categories?

In order to answer the first question, I will build two classifier models. One for predicting the sentiment label, and one for predicting the emotion label of the utterances in the phone calls. To answer question two, I will further investigate into which features are most indicative of the different sentiment and emotion classes.

Before setting up the experiments, it is useful to know a bit more about the theoretical back-ground. This is presented in chapter2. Then a description of the data is provided in chapter3. Chapter4gives a detailed overview of the method that I used to run my experiments. Next, the results are presented and discussed in chapter5. Finally, chapter6provides a conclusion where I answer the research questions, discuss limitations and provide recommendations for future research.

(7)

2

_{B A C K G R O U N D}

Sentiment analysis

Sentiment analysis is the automatic process of mining opinions, emotions, views and attitudes from speech and text using Natural Language Processing. A sub-field of this problem, called polarity classification, aims at separating texts into positive, negative and neutral documents by exploiting certain syntactic and linguistic features. Over the years, the research on this topic has evolved a lot. These classification tasks can be performed on different levels of granularity.

Kharde et al.(2016) describe four levels of granularity regarding sentiment analysis: document

level, sentence level, feature level, and word level. Unsupervised techniques

Different techniques for sentiment classification can be divided into two categories: super-vised approaches and unsupersuper-vised approaches. A traditional way to perform unsupersuper-vised sentiment analysis is the lexicon-based method. These methods employ a sentiment lexicon to determine overall sentiment polarity of a document. (Hu et al.,2013)

(Wilson et al.,2005) presents an unsupervised approach to phrase-level sentiment analysis

that first determines whether an expression is neutral or polar and then disambiguates the polarity of the polar expressions. For their experiments, they created a polarity subjectivity lexicon containing over 8.000 words.

A slightly different approach is done by Turney (2002) who used a technique to classify

reviews as recommended or not recommended based on semantic information of phrases con-taining an adjective or adverb. He computes the semantic orientation of a phrase by mutual information of the phrase with the word ’excellent’ minus the mutual information of the same phrase with the word ’poor’. Out of the individual semantic orientation of phrases, an aver-age semantic orientation of a review is computed. A review is recommended if the averaver-age semantic orientation is positive, not recommended otherwise.

Lexicon-based sentiment analysis comes with some limitations. First of all, for shorter documents or phrase level classification, it can be hard to measure overall sentiment as the system has less information to base the decision on. Second, it is difficult to define a universally optimal sentiment lexicon to cover words from different domains. Users may use different words to express their opinions in distinct domains. And people often create new expressions, especially in social media. (Hu et al.,2013)

Supervised techniques

The supervised approaches use machine learning and involve training a sentiment classifier. The classifier needs labelled training examples and a series of feature vectors. The selection of features is crucial to the success rate of the classification. Most commonly, a variety of unigrams (single words from a document) or n-grams (two or more consecutive words from a

(8)

background 5

document) are chosen as feature vectors. Support Vector Machines (SVMs) and the Naive Bayes algorithm are the most commonly employed classification techniques. (Annett and Kondrak,

2008)

Pang et al.(2002) have shown that supervised approaches can outperform unsupervised

ap-proaches in performance. They employed machine learning techniques to determine whether a written movie review is positive or negative. They experimented with three algorithms; Naïve Bayes, Maximum Entropy and SVM. A variety of features and parameters were used, such as unigrams vs. bigrams and feature frequency vs. feature presence. They achieved the highest performance with the SVM classifier and only unigrams as features.

Wang(2015) performed sentiment analysis on Yelp user reviews. They used binary

classi-fication where they treated a star rating of 4 or 5 as positive and a rating of 1, 2 or 3 stars as negative. Three different algorithms were implemented to predict sentiment; perceptron learn-ing, Naïve Bayes and SVM. For features, they used word unigrams, bigrams and character 5-grams. Their scores improved by using stemming and removing stop words and common symbols as text processing methods.

Emotion classification

Some years ago, a more refined version of sentiment analysis has been proposed, which aims at a more detailed categorization of documents based on the emotions they express. This version is more challenging as it comes with a bigger set of classes that the documents can be assigned to. Both versions of the problem (sentiment and emotion classification) follow similar approaches.

Mac Kim et al. (2010) conducted experiments with different lexicon-based approaches to

detect four primary emotions, i.e., anger, fear, joy, and sadness, from various text sources ranging from fairy tales to news headlines. They concluded that the best performing approach differed based on the type of data used.

Strapparava and Mihalcea(2008) used four different unsupervised techniques and a

super-vised technique for classifying news headlines with the emotion labels anger, disgust, fear, joy, sadness and surprise. The unsupervised approaches are based on the WordNet Affect lexicon. (Strapparava et al.,2004) The supervised approach uses a Naïve Bayes classifier, trained on

a corpus of blog entries from LiveJournal.com. The blog posts were annotated with moods that were mapped to the six emotions used in the classification. The Naïve Bayes classifier performed best for the emotions joy and anger, which had the largest number of blog posts in the training data set. For all the other emotions, the best performance was obtained with the unsupervised models.

Neural networks

Recently, the use of neural networks or deep learning for sentiment analysis has grown a lot in popularity. Neural networks are systems with interconnected nodes that work much like neurons in the human brain. The processing ability of the network is stored in the inter-node connection strengths, or weights, obtained by a process of learning from a set of training patterns. (Yaeger,2017)

Munikar et al.(2019) propose a deep learning approach on sentiment analysis using BERT,

(9)

background 6

Transformer architecture. In their paper, they use the pretrained BERT model and fine-tune it for fine-grained sentiment classification on the Stanford Sentiment Treebank (SST) data set. Their model was able to outperform traditional classification models.

Recent deep learning approaches often make use of pretrained word embeddings. These methods only need a little bit of extra training data to fine-tune the model. This is an advantage compared to the traditional classifier systems as it is often expensive and time consuming to acquire lots of labelled training data.

However, neural networks also come with some limitations. A big disadvantage is that the results that neural networks produce are derived from complex mathematics. Therefore, It is almost impossible to understand what the system based its decisions on.

Sentiment analysis on call center conversations

A lot of research on sentiment analysis uses online reviews, because there are many of them available and they always contain an opinion. Tweets are also frequently used for sentiment classification. These texts are typically shorter and emotions are expressed more subtle here compared to reviews, which makes this task more challenging. Less studies have been exe-cuted using call center conversations. These make the task even harder, as spoken language is less structured than written language. For this reason, many of the studies in this area focus on non-textual aspects of the call such as rhythm, intonation and loudness. (Fernandez,2004),

(Ververidis et al.,2004)

Only a few studies have tried to identify sentiment in call center conversations by using features extracted from the text itself. One of these studies, byPark and Gates(2009), uses a

combination of textual and non-textual features. The textual features included the existence of lexicon terms in the text as well as identification of competitor names. The non-textual features included the analysis of the pause lengths in the conversation, the talking speed of the customer and the relative number of words spoken by each speaker throughout the conversation.

Kanchinadam et al.(2021) developed a graph neural network (GNN) system for predicting

customer satisfaction following incoming phone calls. The system takes as an input speech-to-text transcriptions of calls and predicts the satisfaction reported by customers on post-call surveys (scale from 1 to 10). The researchers concluded that this approach yields more accu-rate satisfaction predictions than standard regression and ranking models. The system has now been implemented into a production pipeline of a large US company that is currently predicting caller satisfaction of approximately 30,000 incoming calls each business day. Gener-ated reports are read by managers and decision makers in the company to potentially improve processes that impact daily customer interactions.

In this study, I will perform classification on sentence level. I am going to carry out a combi-nation of the supervised and unsupervised techniques. I will train the classifiers on different textual features, including features based on sentiment and emotion lexicons. In contrast to most earlier work on performing sentiment analysis on call center conversations that uses the audio of the calls, I am merely going to use textual features. This way, the focus will be on what words or topics are related to certain sentiment or emotions, instead of the way of talking

(10)

background 7

for instance. Where my research also differs from previous work is that I will be using Dutch conversations. The majority of research in this field has been done using English data, while for Dutch conversations, there is still a lot to discover. Because an analysis of the results is an important part of my research, I am primarily going to focus on the traditional classification techniques. However, I think it will also be interesting to see how a neural network performs on this task. Therefore, I am also going to carry out some experiments using Fasttext1

, an open-source library that allows easy implementation of a neural text classifier.

1

(11)

3

_{D A T A}

3.1 collection

For this research, a data set was created from Dutch call center conversations. These are outbound phone calls, which means that clients are contacted by the call center agents. The purpose of the conversations is to convince a customer of buying a newspaper subscription or donating money to a good cause. The creation of the data set was done in three steps. First, the phone calls were divided into short utterances. This was done by splitting up the recordings where a silence of at least half a second occurred. In the second step, these recording chunks were transcribed using a speech to text system. In the last step, the sentences were corrected and annotated with six different labels. Chunks that included utterances from two different persons were split up. The annotators corrected mistakes in the transcripts and labeled the sentences with the following labels:

1. Speaker: Indicates whether the utterance is said by the agent or client. values: [agent (a) / client (k)]

2. Gender: Indicates whether the speaker is male or female. values: [male (m) / female (f)]

3. Sentiment: Represents the sentiment that is expressed by the utterance. values: [positive (1) / neutral (0) / negative (-1)]

4. Emotion: Represents the emotion that is expressed by the utterance.

values: [ratio / expectation / joy / trust / love / surprise / shame / sadness / fear / anger / disgust]

5. Journey Step: Represents the state that the client is in during this utterance. values: [see / do / think / feel / want]

6. Result: Represents the result of the call.

values: [sale (s) / no sale (ns) / cancellation (a) / unknown (u)]

The emotion labels were chosen based on Plutchik’s Wheel of Emotions. American psycholo-gist Robert Plutchik proposed that there are eight primary emotions that serve as the founda-tion for all others, namely joy, anger, fear, trust, disgust, surprise and expectafounda-tion. (Plutchik,

1980) We added the label ’ratio’ for labeling utterances that express no emotion. The emotions

’shame’ and ’love’ were also added because we think these are specific emotional states that are not represented well by Plutchik’s eight basic emotions, but which do occur regularly in these call center conversations.

Tables1 and2show an example of an utterance from the data set with the corresponding labels.

(12)

3.2 summary 9

Table 1: Example sentence from the data set Id Transcript

1 dag mevrouw de wit u spreekt met els janssen namens de leprastichting

Table 2: Labels for example sentence

Id Speaker Gender Sentiment Emotion Journey Step Result

1 a f 0 ratio see s

3.2 summary

The final data set contains 4379 utterances from 65 different calls. The sentences have an average length of 13 words. Table3 and table4show the distributions of the sentiment and emotion labels. As you can see, the distributions of both labels are quite skewed. This will form a challenge for the classification systems but this does represent a natural situation.

Table 3: Distribution of sentiment labels Sentiment label Occurrences Percentage

neutral (0) 3064 70%

positive (1) 1057 24%

negative (-1) 257 6%

As some emotions are perceived as more positive and some emotions as more negative, I expected there would be a correlation between the emotion and sentiment labels in the data set. To visualize this, I made a matrix of the co-occurrences of the two labels (see Figure1). The matrix contains percentages which indicate the share of sentences with that emotion/sentiment combination over the total number of sentences with that emotion.

The matrix clearly shows that some emotions occur mostly with positive sentiment like love, joy and trust. The emotions fear, sadness, anger and disgust on the other hand occur mostly with negative sentiment. The remaining emotions occur mostly with neutral sentiment. This correlation implies that the sentiment label could be a good indicator for detecting emotion and vice versa.

3.3 splitting

The data set was divided into three separate sets; one for training, one for development and one for testing. The sets contain 80%, 10% and 10% of the sentences consecutively. The splitting of the data set was done in a stratified way. This means that all three sets have the same distribution for the sentiment and emotion labels as the original data set.

(13)

3.3 splitting 10

Table 4: Distribution of emotion labels Emotion label Occurrences Percentage

ratio 2633 60% expectation 526 12% joy 439 10% trust 340 8% love 100 2% surprise 75 2% shame 74 2% sadness 60 1% fear 50 1% anger 48 1% disgust 33 1%

(14)

4

_{M E T H O D}

In this chapter, I will report on the method I used to run my experiments. To answer my research questions, I built two classifier models using scikit-learn1

, a Python library for per-forming natural language processing. I started out with two very basic classifiers, one for predicting sentiment and one for predicting emotion. The basic models only use word uni-grams vectorized by the CountVectorizer as features. For both models, I experimented with different algorithms, vectorizers, features and hyperparameters to reach the best performance. All scores reported in this section are achieved using development data.

4.1 algorithm

For both models, I compared three different algorithms. Because I want to focus on a good analysis of the results, I chose algorithms that make it possible to print out the most infor-mative features based on weights that are assigned to the features. This is only possible with linear models. I compared a Naive Bayes algorithm (MultinomialNB) and two Support Vector Machines (SVC with a linear kernel and LinearSVC). The Naive Bayes algorithm is based on Bayes’ theorem and is called naive because of the assumption that the effect of a feature is in-dependent of other features. SVC and LinearSVC are two similar implementations of Support Vector Machines (SVM’s) which have a few differences. One difference is the kind of estima-tors that they use. The SVC implementation is based on the libsvm library, where LinearSVC is based on the liblinear library. With liblinear, you have more flexibility in the choice of penal-ties and loss functions. Another difference is how they implement multi-class classification. SVM’s are originally designed for binary classification. To solve classification tasks with more than two classes, the algorithm can split the multi-class dataset into multiple binary datasets and fit a binary classification model on each. Two different examples of this approach are the one-vs-rest and one-vs-one strategies. The SVC algorithm uses the one-versus-one approach, which considers each binary pair of classes and trains a classifier on the subset of data contain-ing those classes. Here, the class which has been predicted most is the answer. The LinearSVC algorithm uses the one-versus-rest approach. This approach takes one class as positive and the rest as negative and predicts a probability. The class with the highest probability is selected.

Table 5 shows the achieved accuracy score and the top three features per class for the sentiment classifier. Table6shows the results for the emotion classifier.

For the sentiment classifier, the Naive Bayes algorithm achieved the highest accuracy score, followed by the SVC with linear kernel and lastly the LinearSVC. The most informative fea-tures on the other hand, show a different result. With the Naive Bayes algorithm, the top features are almost identical which indicates that the model can not differentiate well between the three classes. The high accuracy score could have been achieved by focusing on charac-teristics in the text that are very specific for this data set. The top features for the SVC with

1

https://scikit-learn.org/stable/

(15)

4.2 vectorizer 12

Table 5: Accuracy and Most Informative Features with different algorithms for the sentiment classifier.

Algorithm MultinomialNB SVC(kernel=linear) LinearSVC

Accuracy 0.753 0.735 0.719

Negative features ik erg och

dat afschuwelijk afschuwelijk

niet fijn verdwijnen

Neutral features u moet gelezen

ik niet tekenen

dat niks fijner

Positive features u gelezen dankbaar

ik uitgelegd vaker

dat verwacht super

linear kernel are better, but still a bit strange. The top features for the negative class contain the word fijn (nice) which is clearly a positive word. Even though the LinearSVC acquired the lowest accuracy score, the top features make the most sense out of the three classifiers. The top features for the negative class are clearly negative words and the top features for the positive class are clearly positive words.

The emotion classifier shows similar results. The top features for the Naive Bayes algorithm and SVC with linear kernel are very similar over the classes. With the LinearSVC, even though the accuracy is not the highest, the top features make much more sense. This indicates that this algorithm is, in general, better at distinguishing between the different classes. So in the following experiments, I am going to use the LinearSVC algorithm for both classifiers.

4.2 vectorizer

For the models to be able to process the text, the documents must be converted to vectors. This is done with a vectorizer. I compared two different vectorizers for both models. The CountVectorizer produces a sparse representation of the text by simply counting the words. The TfidfVectorizer creates a normalized version of this representation by multiplying the term frequency (TF) by the inverse document frequency (IDF). Using this method, tokens that occur very frequently in a data set and that are hence less informative will have less impact than features that occur in a small fraction of the data set. As table7 shows, both classifiers achieved a higher accuracy score with the TfidfVectorizer compared to the CountVectorizer. Therefore, in the following experiments, the TfidfVectorizer will be used to transform the text for both classifiers.

4.3 data processing

As next experiment, I tried out different data processing methods to see whether this would improve the score. As the data set is created from spoken instead of written text, the utterances are already free of punctuation and casing. To further simplify the text, I tried removing stop words, stemming and lemmatizing the utterances.

(16)

4.3 data processing 13

Table 6: Accuracy and Most Informative Features with different algorithms for the emotion classifier. Algorithm MultinomialNB SVC(kernel=linear) LinearSVC

Accuracy 0.621 0.548 0.559

Anger ik gehad jezus

dat zeg download

en keer gehad

Disgust ik download afschuwelijk

dat jezus se

eh begrijp niks

Expectation zeg gelezen leesvoer

dat jezus aanklikken

ik klaar steunpilaar

Fear het niet haalbaar

ik jezus oeh

dat ehm riskeert

Joy u jezus vaker

ik ehm fijnst

dat niet allebei

Love u foto dankjewel

voor download supermooi

ik jezus bedankt

Ratio u ehm fijner

ik download tijdelijke

de heb direct

Sadness en jezus verdwijnen

die maar och

dat gehad vanwege

Shame ik ehm excuus

u gehad sorry

het op unicef

Surprise ik niet poeh

dat jezus zesentwintig

niet ehm jeetje

Trust dat afschuwelijk mogelijkheid

u se bezorgplicht

ik bezuinigen mooier

Table 7: Comparison of vectorizers for both classifiers. Sentiment classifier Emotion classifier

Vectorizer Accuracy Accuracy

CountVectorizer 0.719 0.559

(17)

4.4 features 14

Stop words are commonly used words in a language like a, the and is. These words occur in almost all documents and are therefore not good indicators of sentiment or emotion. So, removing them could improve the system. I use a list of Dutch stop words from Python’s Natural Language Toolkit (NLTK)2

.

Stemming and lemmatizing are both ways of reducing words to a root form. The purpose of this is reducing the vocabulary. With lemmatization, words are changed to their dictionary form or lemma. In English, for example, the words runs, ran and running all have the same lemma run. With stemming on the other hand, words are reduced by simply removing prefixes and suffixes. For stemming, I used NLTK’s SnowballStemmer. For lemmatization, I used the Dutch language model from the spaCy3

library.

Table 8 shows the obtained accuracy scores using the different processing methods for both classifiers. Removing stop words and lemmatization improves the sentiment score for the sentiment classifier. For the emotion classifier, all three methods improve the score, but stemming the text works better here than lemmatizing.

Table 8: Results with different text processing methods for the sentiment classifier. Sentiment classifier Emotion classifier

Processing method Accuracy Accuracy

no processing 0.747 0.598

removing stop words 0.753 0.610

stemming 0.744 0.619

lemmatization 0.749 0.610

4.4 features

After testing algorithms, vectorizers and processing methods, I experimented with different feature combinations. The features that I implemented can be divided into three groups. Namely, textual, lexicon and label features.

Textual features

For textual features, I implemented different combinations of word and character n-grams. An n-gram is a contiguous sequence of n items from a given sample of text. Word bigrams (two consecutive words) or trigrams (three consecutive words) can help capture the context of a word. For example, when only using unigrams, the word ’happy’ would be counted as positive, even in a sentence like "I’m not happy." Adding bigrams could help the system understand that ’not happy’ has a negative meaning.

The n-gram features are noted with two values which indicate the lower and upper bound-ary of the range of n-grams to be extracted. For example an n-gram range of (1,1) means only unigrams, (1,2) means unigrams and bigrams, and (1,3) means unigrams, bigrams and tri-grams. Accuracy results for the different n-gram features can be found in table20in appendix A.

2

https://www.nltk.org/ 3

(18)

4.4 features 15

Lexicon features

For the lexicon feature group, I implemented two different lexicons. For the sentiment classifier, I used the Duoman sentiment lexicon (De Smedt and Daelemans,2012). This lexicon contains

8782 Dutch words together with a sentiment polarity score. The score ranges from -2 to 2. Table9shows some example words from the Duoman sentiment lexicon. For every utterance, the words will be looked up in the dictionary. The sum of all scores will be divided by the number of words in the utterance. This average score will be used as a feature.

Table 9: Example words from the Duoman sentiment lexicon.

Word Score

demotiverend -2.0 fakkeldrager 0.0 pretmaker 1.5

For the emotion classifier, I used the Lilah emotion lexicon (Daelemans et al.,2020). This

is a version of The NRC emotion lexicon (Mohammad and Turney, 2013) that is translated

to Dutch. The Lilah lexicon contains 6458 dutch words paired with ten binary values that indicate whether the word is associated with two sentiments (negative and positive) and eight basic emotions (anger, anticipation, disgust, fear, joy, sadness, surprise and trust). Table 10 shows some example words from the Lilah emotion lexicon.

Table 10: Example words from the Lilah emotion lexicon.

Word Pos Neg Ang Ant Dis Fea Joy Sad Sur Tru

verlaten 0 1 1 0 0 1 0 1 0 0

bewondering 1 0 0 0 0 0 1 0 0 1

ambulance 0 0 0 0 0 1 0 0 0 1

The Lilah lexicon contains similar words with different linguistic forms like vervreemd (alien-ated) and vervreemding (alienation). To make the lookup easier, I decided to convert all the words in the lexicon to lemma’s. I used SpaCy again for the lemmatization of the lexicon. After the lemmatization, I removed all lemma’s that occurred more than once. By doing this, I assumed that words that are mapped to the same lemma, are also associated with the same emotion. To turn this into a feature, each word in an utterance is looked up in the lexicon. The binary scores of each word occurring in the lexicon are added up and divided by the number of words in the sentence. This list of average scores is used as a feature. Table21in appendixA shows the results for the lexicon features. Unfortunately, adding the Duoman lexicon feature decreases the accuracy score obtained by the sentiment classifier. The emotion classifier does improve with the Lilah lexicon feature added, but only slightly.

Label features

As last feature group, I implemented some features based on labels in the data set. These labels were assigned to the utterances during the annotation process. If you would want to use these models in the future, to classify new, unlabelled utterances, you would not have these features of course. But for this research, I think it will be interesting to see what influence these features

(19)

4.5 hyperparameters 16

have on the results. Moreover, if the label features improve the models, the most informative features per class will be more reliable. This will facilitate the analysis of the results.

For both classifiers I used the speaker and journey step labels as features. The speaker label indicates weather the utterance is said by the client or the agent. The journey step label represents the state that the client is in during this utterance. The possible values for this label are see, do, think, feel and want.

As I showed in chapter3, there is a correlation between sentiment and emotion in the data set. This means that the sentiment label could be a good indicator for emotion and vice versa. To show this, I added the gold emotion label as feature for the sentiment classifier and the gold sentiment label as feature for the emotion classifier. The results of the label features can be found in table21in appendixA. As expected, adding the emotion and sentiment gold label features increases the scores a lot. The journey step label feature also increases the performance of both classifiers. Adding the speaker label on the other hand lowers the accuracy for both classifiers.

Data processing & feature combinations

After seeing how the different data processing methods and features perform on their own, I tried out different combinations of the methods and features that improved the score. This way, I attempted to reach the highest accuracy score for both classifiers. The results of this experiment can be found in tables 11and 12. The tables are spit up in two parts. The first part shows processing method and feature combinations without the label features. This is to show how the models would perform in future cases on new, unlabelled data. The second part shows combinations including the label features.

Without label features, the sentiment classifier performed best with stop words removed, word uni- and bigrams and character 2 to 4-grams as features. This resulted in an accuracy score of 0.781. With label features, the sentiment classifier performed best with stop words removed, word uni- and bigrams and the speaker and emotion labels as features. This resulted in an accuracy score of 0.874. The emotion classifier reached an accuracy of 0.619 without label features. This was obtained using word unigrams, character 2 to 5-grams and the lexicon feature. With label features, the emotion classifier reached an accuracy score of 0.694. This was achieved with only word unigrams and the sentiment label as features.

4.5 hyperparameters

As last experiment on the classifiers, I attempted to further increase the score of the best performing models by changing the hyperparameters of the LinearSVC algorithm. I tried different values for parameters C, loss and class_weight. C is a regularization parameter, which trades off misclassification of training examples against simplicity of the decision surface. The higher the value of C, the larger the penalty to errors is. Loss specifies the loss function. ’hinge’ is the standard SVM loss function while ’squared_hinge’ is the square of the hinge loss. Class_weight assigns a weight to parameter C for each class. If not given, all classes have the same weight. The ’balanced’ mode automatically adjusts the weights inversely proportional to the class frequencies. Table22in appendixAshows the results of this experiment. For both classifiers, changing the parameters did not lead to a better performance.

(20)

4.5 hyperparameters 17

Table 11:Accuracy results with different combinations of processing methods and features for the senti-ment classifier.

Processing methods Features (excluding label features) Accuracy

rm stop words, lemmatizing word (1,2), char (2,4) 0.756

rm stop words word (1,2), char (2,4) 0.781

rm stop words word (1,2), char (2,4), lexicon 0.776

word (1,2), char (2,4) 0.774

Processing methods Features (including label features) Accuracy

rm stop words, lemmatizing word (1,2), char (2,4), journey step, emotion 0.863 rm stop words word (1,2), char (2,4), journey step, emotion 0.865 rm stop words word (1,2), char (2,4), journey step, emotion, speaker 0.868 rm stop words word (1,2), journey step, emotion, speaker, lexicon 0.872 rm stop words word (1,2), journey step, emotion, speaker 0.872

rm stop words word (1,2), journey step, emotion 0.872

rm stop words word (1,2), emotion 0.872

rm stop words, lemmatizing word (1,2), emotion 0.872

rm stop words word (1,2), emotion, speaker 0.874

rm stop words word (1,2), char(2,4), emotion, speaker 0.868

word (1,2), emotion, speaker 0.865

word (1,3), emotion, speaker 0.863

Table 12:Accuracy results with different combinations of processing methods and features for the emo-tion classifier.

Processing methods Features (excluding label features) Accuracy

rm stop words, stemming word (1,1), char (2,5), lexicon 0.598

rm stop words word (1,1), char (2,5), lexicon 0.596

rm stop words word (1,1), char (2,5) 0.594

stemming word (1,1), char (2,5), lexicon 0.598

stemming word (1,1), char (2,5) 0.598

word (1,1), char (2,5), lexicon 0.619

word (1,1), char (2,5) 0.616

Processing methods Features (including label features) Accuracy

rm stop words, stemming word (1,1), char (2,5), lexicon, journey step, sentiment 0.676 rm stop words word (1,1), char (2,5), lexicon, journey step, sentiment 0.687 rm stop words word (1,1), lexicon, journey step, sentiment 0.680 rm stop words word (1,2), lexicon, journey step, sentiment 0.685 rm stop words word (1,3), lexicon, journey step, sentiment 0.678 rm stop words word (1,2), journey step, sentiment 0.685

rm stop words word (1,2), sentiment 0.674

word (1,2), sentiment 0.664

word (1,1), char (2,5), sentiment 0.685 word (1,1), journey step, sentiment 0.680

word (1,1), lexicon, sentiment 0.692

rm stop words, stemming word (1,1), sentiment 0.683

rm stop words word (1,1), sentiment 0.687

stemming word (1,1), sentiment 0.685

(21)

4.6 fasttext 18

4.6 fasttext

As I explained in chapter 1, my primary focus of this study is on the traditional classifiers because they make it possible to analyse the results. However, a lot of recent related work achieves good scores using neural networks. Therefore, I want to see how a neural network performs on this task, and whether it can outperform the traditional classifiers. I used Fasttext for this, an open-source library that allows easy implementation of a neural text classifier. I trained two Fasttext models on the sentences in my training data set, one with the sentiment labels and one with the emotion labels. Then, I tested the models on my development data set. Table 13shows the results of this experiment. The first row shows accuracy4 results for the models which are only trained on my training data set. The second row shows scores obtained with Dutch pre-trained word vectors5

added to the training process. Both models perform worse with the added vectors. A possible explanation for this could be that these vectors are not trained on call center data.

Lastly, I used the automatic hyperparameter optimization feature of Fasttext to automati-cally select the best parameter combinations for both models without the pre-trained vectors. With these settings, the sentiment model achieved an accuracy of 0.758. The emotion model achieved an accuracy of 0.653. Table 23 in appendix A shows the optimal hyperparameter values for both models.

Table 13: Accuracy scores for both fasttext models.

Sentiment model Emotion model

Parameters Accuracy Accuracy

dim=300 0.742 0.600

dim=300, pretrainedVectors=’cc.nl.300.vec’ 0.708 0.534

optimal parameters 0.758 0.653

4.7 evaluation

The executed experiments showed that the sentiment classifier performed best on the devel-opment data with stop words removed, using a LinearSVC algorithm with default parameters and a combination of word uni- and bigrams and the emotion and speaker labels as features. The emotion classifier achieved the best results using the LinearSVC algorithm with default parameters and word unigrams and the sentiment label as features. In the next chapter, I will run both classifiers with these optimal settings on the test data set and report the results. To evaluate the performance of the two classifiers, I will look at several evaluation scores like accuracy, precision, recall, and F1-score. I will also run the best performing models without label features on the test data, to show how the models would perform on unlabelled data. Lastly, I will run the Fasttext models with optimal parameters on the test data set.

To put these results into perspective, the accuracy scores will be compared to a majority baseline. A model that has learned nothing, could reach a reasonable score by always

predict-4

These scores indicate the number of correct labels among the labels predicted. Fasttext outputs these scores in terms of precision. I refer to them as accuracy.

5

(22)

4.7 evaluation 19

ing the most frequent label in the training set. For the sentiment classifier, this would give an accuracy of 0.7, because 70% of the utterances in the training set are labelled as neutral, which is the largest class. Regarding emotion, the most frequent class is ratio, which takes up 60% of the utterances in the training set. Therefore, the majority baseline for the emotion classifier is an accuracy of 0.6.

For both models, I will also plot a confusion matrix of the predicted labels to get a clear view of the classification performance per class. Lastly, in regard to my second research question, I will analyse the most informative features of each class extensively.

(23)

5

_{R E S U L T S & D I S C U S S I O N}

In this chapter, I will report and discuss the performance of the classifiers on the test data set.

5.1 sentiment classification

Comparison of the sentiment models

Table14shows the obtained accuracy scores of the sentiment classifier models with and with-out label features, the Fasttext model and the majority baseline. All three models with-outperform the baseline, from which I can conclude that the models have learned to distinguish between the different sentiments. The LinearSVC without label features performed slightly better than the baseline. The LinearSVC with label features scores much better, which shows that the emo-tion and speaker label features have a big impact on the performance. For the model without label features, the accuracy on the test set dropped only a little bit compared to the develop-ment set. For the model with label features, the accuracy increased. This means there is no sign of overfitting, which is good. Against expectations, the Fasttext model scores lower than both LinearSVC’s. A possible explanation for this could be that the training data set was too small for the Fasttext model to properly learn the differences between the sentiments.

Table 14: Accuracy scores for the sentiment classifier.

Development set Test set

Model Accuracy Accuracy

Majority baseline 0.700 0.700

LinearSVC without label features 0.781 0.767 LinearSVC with label features 0.874 0.877

Fasttext model 0.758 0.758

Detailed results of the sentiment LinearSVC with label features

Table 15shows the obtained precision, recall and F1 scores by the LinearSVC classifier with label features on the test set. The classifier predicted the neutral class really good with an F1-score of 0.915. The classifier scores lower on the negative and positive classes, which makes sense as these classes are a lot smaller.

Figure 2 shows a confusion matrix of the predicted sentiment labels. The matrix shows that the system has no trouble with distinguishing between negative and positive utterances. If the system made a mistake, this was based on a confusion between negative and neutral sentiment or positive and neutral sentiment. This reflects the ordinal scale of the labels, where positive is closer to neutral than to negative and negative is closer to neutral than to positive.

(24)

5.1 sentiment classification 21

Table 15: Precision, recall, F1 and accuracy scores for the sentiment classifier.

Class Precision Recall F1-score

negative 0.895 0.654 0.756 neutral 0.892 0.938 0.915 positive 0.821 0.750 0.784 Macro avg 0.869 0.781 0.818 Weighted avg 0.875 0.877 0.874 Accuracy 0.877

(25)

5.1 sentiment classification 22

Most informative features of the sentiment LinearSVC with label features

To gain insight of the system, I extracted the most informative features for each class. The linearSVC classifier assigns a coefficient score to all features for the different classes. These coefficient scores provide transparency of what the model has learned in the training process. They allow you to better understand the system and what it based its decisions on. The features with the highest coefficient scores are the features that the classifier learned to be important indicators for the given class. Table 16 shows the top twenty features per class that were assigned the highest coefficient scores. Because stop words were removed from the utterances, some bigrams among these features can look a bit strange. Possibly, there was an extra function word between the two words in the original text.

Table 16:Most Informative Features per class for the sentiment classifier.

Negative Neutral Positive

jammer duidelijk dank prima

emotion=sadness emotion=ratio jazeker emotion=anger maken natuurlijk zit ehm emotion=disgust klopt gekregen emotion=joy

mis jawel wel

komt goed bent bekend emotion=love

komt euro wel zeker weten

behoefte ingb leuk

nee doe geweldig heerlijk gebeurd

afgelopen zaterdag dag wensen nieuws

weten app dag

moeilijke vraag wachten mevrouw goed

tegelijk krant tubantia jahoor

helemaal flauw doe nummer oh goed

moeilijke geïnteresseerd actualiteit goed ga

gemaakt wensen das verlaging

lang werk hoor nee vraag meneer

vind fijn jongeman hoor mag

daarna zag oh jongeman geval heel

zag twee helemaal top extra

The top features again show that the emotion label feature is really important for the system. These features are noted as ’emotion=[emotion label]’, and appear in the top of all three sentiment classes. The labels sadness, anger and disgust are clear indicators for negative sentiment. The ratio label is an indicator for neutral sentiment and the joy and love emotion labels are good indicators for positive sentiment.

Further, we see a number of words from which you would expect them to be associated with the given class. For instance jammer (unfortunate) and mis (wrong) for the negative class and prima (fine (adj)), leuk (nice) and goed (good) for the positive class. These words clearly express sentiment, they state how someone feels about something. But you need to have some context to know what it is, that invokes these positive or negative feelings. On their own, we cannot draw conclusions from these words.

I also spot some words and bigrams in the positive and negative lists that relate to (un)certainty. For the negative class, the informative features include weten (to know), moeilijke vraag (difficult

(26)

5.2 emotion classification 23

question) and moeilijke (difficult). These words are indicators of uncertainty. For the postive class, the top features include jazeker (ofcourse) and zeker weten (sure), words that indicate cer-tainty. This gives me the impression that uncertainty is associated with negative sentiment, where certainty is associated with positive sentiment.

The top features for the neutral class include doe nummer (do number) and ingb. ’ingb’ are the four first letters of bank account numbers from a Dutch bank. ’nummer’ could refer to a bank account, a telephone or a house number. During the phone calls, if a client has accepted the offer, the agent will ask the client for their address and payment details. This is the part of the conversation where these features would occur. These features give the idea that the expressed sentiment during this part of the conversation is mostly neutral.

5.2 emotion classification

Comparison of the emotion models

Table17shows the obtained accuracy scores of the emotion classifier models with and without label features, the Fasttext model and the majority baseline. Again, all three models outper-form the baseline, but not as much as the sentiment classifiers. This is not a big surprise, as the emotion classification task is a lot more challenging compared to the sentiment classification due to the bigger number of classes. The LinearSVC with label features performs a bit better than the model without label features. And just like the sentiment classification, the Fasttext model scores lower than both LinearSVC models.

Table 17: Accuracy scores for the emotion classifier.

Development set Test set

Model Accuracy Accuracy

Majority baseline 0.600 0.600

LinearSVC without label features 0.619 0.642 LinearSVC with label features 0.694 0.694

Fasttext model 0.653 0.619

Detailed analysis of the emotion LinearSVC with label features

In table 18 you can find the precision, recall and F1-scores obtained by the LinearSVC with label features. The performance of the system varies a lot over the different classes. The macro average F1-score is quite low. The big difference between this score and the weighted average F1-score indicates that the system is especially struggling with the less frequent classes. The model actually only performs well on the ratio class with an F1-score of 0.864. Remarkable is that the expectation class has one of the worst performances, even though this is the second biggest class in the data set. (see table4) This indicates that expectation is a difficult emotion for the model to detect. The system has also difficulty with the surprise class, as none of these utterances were correctly labeled as surprise.

Figure 3 shows a confusion matrix of the predicted emotion labels. This gives a better picture of the kind of mistakes that the system makes. The matrix shows that the model often

(27)

confuses two emotions that are both largely labeled with the same sentiment, as shown in figure 1. As you can see, the majority of utterances with emotion ’expectation’ are wrongly predicted as ’ratio’. Also the surprise and shame utterances are often mistaken as ratio. As figure1illustrates, these are the emotions that are largely labeled with neutral sentiment in the data set, just like the ratio utterances. Therefore, I think the sentiment label feature affects the cause of these errors. The same happens with emotions that appear mostly with with positive sentiment like joy and love, and emotions that appear mostly with negative sentiment like anger, fear and sadness.

Table 18: Precision, recall, F1 and accuracy scores for the emotion classifier

Class Precision Recall F1-score

anger 0.167 0.200 0.182 disgust 1.000 0.333 0.500 expectation 0.429 0.226 0.296 fear 0.500 0.400 0.444 joy 0.490 0.568 0.526 love 0.444 0.400 0.421 ratio 0.806 0.932 0.864 sadness 0.667 0.333 0.444 shame 0.500 0.125 0.200 surprise 0.000 0.000 0.000 trust 0.393 0.324 0.355 Macro avg 0.490 0.349 0.385 Weighted avg 0.658 0.694 0.665 Accuracy 0.694

Most informative features of the emotion LinearSVC with label features

Table19shows the top ten features that were assigned the highest coefficient scores per emo-tion class. The top features for the class ’shame’ show two things. First, based on the words gehoord (heard) and verstaan (understand), I suspect that it evokes feelings of shame when someone could not understand a person during a call. Second, the top features contain a lot of words that are related to apologizing like sorry (sorry), excuses (apologies), excuus (excuse) and kwalijk (part of the dutch saying neem mij niet kwalijk, which translates to excuse me). Based on this, I think that feelings of shame often go together with making apologies.

The top features for the ’love’ class mostly contain words that are said to thank somebody, like bedankt (thanks), bedanken (to thank), dankjewel (thank you), dank (thanks) and dankuwel (thank you). These words are used at the end of almost every call to wrap up the conversation and to thank the other person. However, I think in many cases, these words are said out of politeness instead of genuine emotions of love.

A feature that stands out among the top features for the class ’joy’ is the word koffie (coffee). Apparently, coffee is something that evokes feelings of joy in people.

Besides these small observations, I can actually draw few conclusions from these features. The most informative features for the smallest classes like disgust, anger, fear and sadness occur in very little utterances. For instance, the words jezus (jesus), bezuinigen (economize) and boete (fine (noun)) only occur once in the whole data set. This yet indicates that the data

(28)

(29)

set is too small for such a detailed classification. The system will probably not perform well on other data, that differs in context. Other features that demonstrate this are the words vrouwen (women) for the anger class and meisjes (girls) for the sadness class. The charity, for which donations are requested during these phone calls, is committed to women and girls in developing countries who cannot go to school. Normally, you would not expect these words to be associated with anger or sadness. But in the case of these specific conversations, they do. So, for these specific calls, the words vrouwen and meisjes are very informative but this model would not perform well on other phone calls with different purposes.

Table 19: Most Informative Features per class for the emotion classifier.

Anger Disgust Expectation Fear

keer hebben benoemen haalbaar

hij hoor helpen oeh

jezus bezuinigen aanklikken opnoemen

ehm se zat we

gehad afschuwelijk gelegen terug

download mailtjes mailadres lukt

weer niks leesvoer regelt

vier ’t weekend riskeert

vrouwen zoveel betekent boete

rest interesse me lijkt

Joy Love Ratio Sadness

haha bedankt blad meisjes

wensen bedanken plaatselijke effe

geweldig dankjewel beant vreselijk

wakker dank verlaging opslaan

koffie supermooi tegoed vanwege

allebei jongeman probeer verdwijnen

bepalen allereerst verlengd ze

groen geweldige spijt nog

actualiteit dankuwel moeilijke helaas

dus succes belde heel

Shame Surprise Trust

sorry oh jawel

excuses zesentwintig hoeft

unicef poeh rekenen

excuus jeetje klopt

twijfelde doorgekregen bezorgplicht

gehoord gestuurd verschijnen

verstaan gebeurd telefoon

rekening bekend vrijheid

onderwijsprojecten kaartje gelukkig

(30)

6

_{C O N C L U S I O N}

In this chapter, I will summarize the key findings that I derive from the presented results in chapter5. I will state the research questions and try to answer them. Moreover, I will state some of the limitations of this research and provide recommendations for future research.

6.1 answering the research questions

This thesis aimed at performing sentiment analysis on call center phone calls. Next to predict-ing sentiment and emotion, I focused on analyspredict-ing the results in order to track down what could be improved. I attempted to answer the following two research questions:

1. How can we predict sentiment and emotion of utterances from transcribed call center conversations using machine learning?

2. Which words or topics are most indicative of positive emotions, negative emotions and more fine-grained emotion categories?

To answer the first question, I built two classifier systems to predict the sentiment and emo-tion labels of the utterances in the phone calls. For both tasks, the LinearSVC algorithm with default parameters appeared to be the most suitable based on the most informative features and the obtained accuracy scores. The sentiment classifier performed best with stop words removed, a combination of word uni- and bigrams and the emotion and speaker labels as fea-tures. This model reached an accuracy of 0.877, which exceeds the majority baseline of 0.7. If this approach would be used in future cases, to classify new, unlabeled data, the label features would not be available of course. In this case, the model would perform best with stop words removed, word uni-and bigrams and character 2 to 4-grams as features.

The emotion classifier achieved the best results using only word unigrams and the senti-ment label as features. This model also exceeded the majority baseline of 0.6 with an accuracy of 0.694. The emotion classifier without label features performed best with word unigrams, character 2 to 5-grams and the emotion lexicon as features.

I also compared these results with the performance of two neural classifiers implemented using Fasttext. For both sentiment and emotion classification, the Fasttext models performed worse than the LinearSVC models. This shows that neural networks not always outperform the traditional classifiers. A possible explanation for this could be that the training data set was too small for the Fasttext models to properly learn the differences between the classes. I expect that a neural network could perform better on these tasks if it used pre-trained vectors. Though, provided that these vectors are partially trained on call center conversations, so that the vectors are based on similar data. But even if a neural network would outperform the traditional classifiers, you would still have the problem that they make it almost impossible to analyse what the system based its decisions on.

(31)

6.2 limitations and recommendations for future research 28

In order to answer the second research question, I analysed the most informative features per class for both classifiers. These features can provide transparency of what the model has learned in the training process and what it based its decisions on. The features with the highest coefficient scores are the features that the classifier learned to be important indicators for the given class. By looking at these features, I attempted to get an idea of what topics or words are associated with different sentiments or emotions.

Unfortunately, besides some small observations, I can draw few conclusions from these features. The top features include many words that clearly express the sentiment or emotion of the given class like leuk (nice), jammer (unfortunate), afschuwelijk (awful) or geweldig (awesome). However, on their own, these words are not very informative. I need more context to be able to say what topics are associated with these emotion or sentiment classes.

I also noticed that a lot of the top features for the emotion classes occurred in very little ut-terances. This indicates that the data set is probably too small for such a detailed classification. The model now specializes in classifying phone calls that are very similar to the ones in the training data set. But if the model would be run on other data, that differs in context, it will probably not perform well.

6.2 limitations and recommendations for future research

The data set that is used forms one of the biggest limitations of this research. It is relatively small and the distribution of the classes is really skewed. Especially some of the emotion classes contain very few utterances because of this. This forms a major challenge because for these classes, the models have very little information to base their decisions on.

Besides the size, the content of the data set is also limited. It merely contains outbound phone calls, where the purpose is selling newspaper subscriptions or collecting donations for charity. Besides, the charity that is raised money for is the same for all the calls in the data set. So, the models are trained on very specific conversations and will probably not perform well on other calls where a different product is sold or money is raised for a different charity.

Another limitation of the data set is the way that the utterances were labeled. Whether an utterance belongs to a certain class is not always clear. Different annotators could have different opinions about what the right label should be. Therefore, when annotating data, it is preferable to have multiple annotators label the same training instances. This way, you’re able to compute the inter-annotator agreement (IAA). This is a statistic to measure the reliability between annotators for categorical items. However, creating labeled data is very expensive and time consuming. To keep the costs of the annotation process low, each utterance in our data set was labeled by only one annotator. Therefore, the quality of the labels is less trustworthy.

A first step for future research would be to improve the training data, in order to obtain better and more meaningful results. Then, it would hopefully be possible to track down which topics are associated with different emotions or sentiment. As the task of sentiment classification is easier than the more fine grained emotion classification, this leads to better scores. But when emotion classification is done properly, it can be much more informative than sentiment classification. Because utterances that are perceived as negative, do not necessarily need to be avoided, depending on the the underlying emotion of the utterance and the purpose of the phone call. At first, you might expect sadness to be an emotion to avoid because it is often perceived as negative. But when the purpose of the phone calls is, for example, to raise

(32)

6.2 limitations and recommendations for future research 29

money for poor children in developing countries, a sad story could be very effective in evoking compassion in the customer. In this situation, the emotion ’sadness’ will likely have a positive effect on the number of sales.

This raises another point for further research. If the ultimate goal is to increase the number of sales by changing the call scripts for instance, knowing which topics are associated with different emotions is not enough. You still have to find out how certain emotions affect the number of sales in different situations.

(33)

B I B L I O G R A P H Y

Annett, M. and G. Kondrak (2008). A comparison of sentiment analysis techniques: Polarizing movie blogs. In Conference of the Canadian Society for Computational Studies of Intelligence, pp. 25–35. Springer.

Daelemans, W., D. Fišer, J. Franza, D. Kranjˇci´c, J. Lemmens, N. Ljubeši´c, I. Markov, and D. Popiˇc (2020). The LiLaH emotion lexicon of croatian, dutch and slovene. Slovenian language resource repository CLARIN.SI.

De Smedt, T. and W. Daelemans (2012). " vreselijk mooi!"(terribly beautiful): A subjectivity lexicon for dutch adjectives. In LREC, pp. 3568–3572.

Fernandez, R. (2004, feb). A computational model for the automatic recognition of affect in speech. Ph. D. thesis, Massachusetts Institute of Technology. Dept. of Architecture. Program in Media Arts and Sciences.

Homburg, C. and A. Giering (2001). Personal characteristics as moderators of the relation-ship between customer satisfaction and loyalty—an empirical analysis. Psychology & Market-ing 18(1), 43–66.

Hu, X., J. Tang, H. Gao, and H. Liu (2013). Unsupervised sentiment analysis with emotional signals. In Proceedings of the 22nd international conference on World Wide Web, pp. 607–618. Kanchinadam, T., Z. Meng, J. Bockhorst, V. S. Kim, and G. Fung (2021). Graph neural networks

to predict customer satisfaction following interactions with a corporate call center.

Katz, G., N. Ofek, and B. Shapira (2015, 04). Consent: Context-based sentiment analysis. Knowledge-Based Systems 84.

Kharde, V., P. Sonawane, et al. (2016). Sentiment analysis of twitter data: a survey of techniques. arXiv preprint arXiv:1601.06971.

Mac Kim, S., A. Valitutti, and R. A. Calvo (2010). Evaluation of unsupervised emotion models to textual affect recognition. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pp. 62–70.

Mohammad, S. M. and P. D. Turney (2013). Nrc emotion lexicon. National Research Council, Canada 2.

Munikar, M., S. Shakya, and A. Shrestha (2019). Fine-grained sentiment classification using bert. In 2019 Artificial Intelligence for Transforming Business and Society (AITB), Volume 1, pp. 1–5. IEEE.

Pang, B., L. Lee, and S. Vaithyanathan (2002). Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, USA, pp. 79–86. Association for Computational Linguistics.

(34)

BIBLIOGRAPHY 31

Park, Y. and S. C. Gates (2009). Towards real-time measurement of customer satisfaction using automatically generated call transcripts. In Proceedings of the 18th ACM Conference on Informa-tion and Knowledge Management, CIKM ’09, New York, NY, USA, pp. 1387–1396. AssociaInforma-tion for Computing Machinery.

Plutchik, R. (1980). A general psychoevolutionary theory of emotion. In Theories of emotion, pp. 3–33. Elsevier.

Strapparava, C. and R. Mihalcea (2008). Learning to identify emotions in text. In Proceedings of the 2008 ACM symposium on Applied computing, pp. 1556–1560.

Strapparava, C., A. Valitutti, et al. (2004). Wordnet affect: an affective extension of wordnet. In Lrec, Volume 4, pp. 40. Citeseer.

Turney, P. D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsuper-vised classification of reviews. arXiv preprint cs/0212032.

Ververidis, Kotropoulos, and Pitas (2004). Automatic emotional speech classification. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I–593.

Wang, J. (2015). Predicting yelp star ratings based on text analysis of user reviews.

Wilson, T., J. Wiebe, and P. Hoffmann (2005). Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of human language technology conference and conference on empirical methods in natural language processing, pp. 347–354.

Yaeger, L. (2017). Neural networks pt. 1 terms definitions lecture 5 i400/i590 artificial life as an approach to artificial intelligence.

(35)

A

D E T A I L E D R E S U L T S

a.1

n-gram feature results

Table20shows the accuracy scores obtained with different combinations of word and character n-grams. The n-gram features are noted with two values which indicate the lower and upper boundary of the range of n-grams to be extracted. For example an n-gram range of (1,1) means only unigrams, (1,2) means unigrams and bigrams, and (1,3) means unigrams, bigrams and trigrams. For both classifiers, the highest obtained score is in bold.

Table 20:Accuracy results with different combinations of word and character n-grams for both classifiers. Sentiment classifier Emotion classifier

N-gram Features Accuracy Accuracy

word (1,1) 0.747 0.598 word (1,2) 0.749 0.610 word (1,3) 0.758 0.607 word (1,4) 0.760 0.605 word (1,1), char (2,3) 0.749 0.607 word (1,1), char (2,4) 0.763 0.612 word (1,1), char (2,5) 0.758 0.616 word (1,1), char (2,6) 0.763 0.603 word (1,2), char (2,3) 0.767 0.605 word (1,2), char (2,4) 0.774 0.610 word (1,2), char (2,5) 0.772 0.603 word (1,2), char (2,6) 0.767 0.610 word (1,3), char (2,3) 0.758 0.612 word (1,3), char (2,4) 0.765 0.603 word (1,3), char (2,5) 0.765 0.607 word (1,3), char (2,6) 0.774 0.605 word (1,4), char (2,3) 0.747 0.607 word (1,4), char (2,4) 0.760 0.605 word (1,4), char (2,5) 0.767 0.603 word (1,4), char (2,6) 0.769 0.605

a.2

lexicon & label feature results

Table21shows the accuracy scores with the different lexicon and label features for both classi-fiers. All scores are obtained using a basis of word unigrams and the given feature added. All scores that are higher than the acquired score with only word unigrams are in bold.