Name: Gerard Johan Visser Student number: S1068008 Date: 20-08-2018
Supervisor: Dr. P. Haazebroek Second reader: Dr. R. E. de Kleijn
A chatbot with interpersonal
communication recognition: determine
the position on Leary’s Rose after
Abstract
The research area of chatbots is relatively young and therefore the goal of this study is to
gather more information about text categorization by chatbots. At this time chatbots are
increasingly used in online communication with users. It is a challenge to let chatbots respond
appropriately on an emotional level in a way that users experience the answers as a positive
interaction. The current study examined the possibility of mapping this interaction using text
analysis with the LIWC and classification on Leary’s Rose. Based on Leary’s Rose we
predicted the positive experience with NPS. This study consists of three phases. The first
phase is a text analysis to scale sentences on Leary’s Rose. The sentences were scaled by 102
participants on two scales (the “I & We” scale and the “Dominance & Submissive” scale).
With these scaled sentences, a classifier (classifier A) is created and trained with the LIWC
and a regression analysis. The results of phase one suggests that our database contains mostly
“Dominance/I” and “Submissive/We” sentences. Classifier A (80.8%) is 3% better than the
random baseline (77.8%). Classifier A is tested in phase two with self-annotated sentences.
These self-annotated sentences were from 15 participants on two scenarios’. Based on these
self-annotated sentences we created also two new classifiers (B1 & B2). The test results of
classifier A (59.0%) is the same as the (59.0%) random baseline. The two new classifiers (B1
& B2) created in phase two performed better (B1: 72.3% and B2: 76.6%) than the random
baseline (59.1%). In phase three we tried to predict NPS on Leary’s rose. Based on Kendalls
tau-b correlation and crosstabs we compared the classifiers. The findings suggested that it is
possible to predict NPS based on Leary’s Rose. A possible implication is the need of a
multimodal approach for text analysis. Future research should focus on better ways of
annotation to prevent skewed, small and noisy databases. More implications and suggestions
are presented in the discussion.
Keywords: emotion detection, emotion classification, text analysis, Leary’s Rose,
interpersonal communication, chatbot, Net Promoter Score (NPS)
A chatbot with interpersonal communication recognition:
determine the position on Leary’s Rose after automatic text analyzation.
In recent years, more and more chatbots became available in different areas. Via
chatbot-software a human is able to interact with a computer in natural language. This
software can extend daily life, for example a helpdesk chatbot (Rahman, 2012; Shawar,
Atwell & Roberts, 2005) who is able to answer questions from customers. A customer would
like to have information about a problem with a product and asks the chatbot. The chatbot is
able to answer the question. But chatbots are also used in areas such as in educational tools
(Keshtkar, Burkett, Li & Graesser, 2014; Vaassen & Daelemans, 2010) or in e-commerce and
business (Chattaraman, Kwon & Gilbert, 2012). Because chatbots are used more and more,
improvements should be made and one of the challenges is to detect emotion from the user in
a chatbot conversation. To return to our ‘helpdesk chatbot’ example; imagine a helpdesk
chatbot which can detect someone’s emotion. The chatbot is then able to change its type of
communication based on the emotion of the user in a way he or she feels more understood.
Thereby, the chatbot is able to recognize when the conversation goes sideways and transfer
the customer towards a real human being. To reach this goal more research is needed. This
study focuses on emotion classification of customer conversations with a helpdesk chatbot
from a large online retailer.
Chatbots
A chatbot is a computer program designed to communicate with human users via
natural language. The chatbot can recognize words or a group of words and based on this data
the chatbot gives answers. This type of chatbot has certain benefits. First of all, a chatbot is
always present and takes care for real-time events 24 hours a day. In addition, the
communication by a chatbot is a dialog and is more effective than a monolog, when dealing
with humans (Tatai, Csordás, Kiss, Szaló & Laufer, 2003). Furthermore, the chatbot combines
large amounts of information and only shows the information that is asked for by the user.
And at last, a chatbot can handle many cases simultaneously, which is cost-saving for the
company because fewer employees are needed to answer the questions of the users.
However, at this moment, there are challenges to deal with due to the complexity of human
language and emotions. Computers have difficulties in understanding the endless variability
of expression in how words are meant in language use to communicate meaning (Hill, Ford &
Farreras, 2015). To create a computer program that is capable to interact with a person at a
human level, requires the machine to understand human behaviors. One of the most important
things in a conversation is expressing and understanding emotions and affects (Picard &
Picard, 1997; Salovey & Mayer, 1990).
The possibility to test humanized machines was proposed by Alan Turing in his
“Turing Test” (Turing, 1950). This test is based on a conversation between a computer and a
human judge. It is based on the ability of a computer program to impersonate a human, with
the judge not being able to distinguish between a computer of a human being. One of the first
chatbots subjected to the Turing Test was ELIZA, which is created at the Massachusetts
Institute of Technology by Weizenbaum. ELIZA is a chatbot that emulates a psychotherapist
(Weizenbaum, 1966). After ELIZA, a lot of other chatbots are created with different purposes.
Still none of the chatbots passed the Turing Test (Saygin, Cicekli & Akman, 2000; Warwick
& Shah, 2016).
Emotion and classifying emotions
As described above, emotion is an important factor in humanizing computers.
Classifying customers emotions is important for companies because emotion has an effect on
customer loyalty and satisfaction (DeWitt, Nguyen & Marshall, 2008; Varela-Neira,
Vázquez-Casielles & Iglesias-Argüelles, 2008; Yu, White & Xu, 2007). To measure customer
satisfaction and loyalty a company can use the Net Promoter Score (NPS). Shaw (2016)
introduced a new indicator for emotional value, the Net Emotion Value (NEV). The NEV
measures the emotion value towards a company. His work shows that the higher the NEV
(positive emotion), the higher the NPS and thus has emotion an effect on the NPS.
To extract emotion from text in a chat environment is different than extracting
emotion from face to face interactions between humans. Emotion extraction from text is
missing facial expressions, intonation of voice and body language (Vaassen, 2014). This
complicates the emotion extraction from text. Another difference is the difference in
communication styles in a human-chatbot versus a human-human chat-conversation. Users
tend to be more agreeable, open, extrovert, conscientious, and self-disclosing when
interacting with a human. When a human consciously interacts with a chatbot they report
lower perceived attractiveness, less goal driven and more brutal language than in chats with
humans (Mou & Xu, 2017). Hill and others (2015) found differences between human-human
and human-chatbot communication in: more messages, shorter message lengths, more limited
vocabulary and greater use of profanity. These differences in interaction should be taken into
account when classifying emotion from chatbot texts.
Interpersonal interaction
To achieve emotion recognition in a chat conversation a real time automatic emotion
analysis is needed. In 2010, Vaassen and Daelemans introduced the automatic classification
of text according to a framework for interpersonal communication (Vaassen & Daelemans,
2010; Vaassen & Daelemans, 2011; Vaassen, Wauters, van Broeckhoven, van Overveldt,
Daelemans & Eneman, 2012; Vaassen, 2014). This approach focuses not only on emotion
classification but also the interaction between a human and a chatbot, which is very helpful.
Several different frameworks for interpersonal communication have been developed
over the past years (Gurtman, 2009). The first interpersonal communication model was
created by the Kaiser Research Group by the name interpersonal circle, better known as
“Leary’s Rose” (Leary, 1957). This framework determines two roles, a speaker and a listener.
These two roles change during the conversation, when someone speaks, he is the speaker and
when someone listens, he is the listener. The graphical representation of Leary’s Rose is a
circle which is split vertically in the “I” and “We” side and horizontally in the “Dominance”
and “Submission” side. The horizontal axis determines if the speaker is dominant or
submissive towards the listener. The vertical axis determines the speaker’s willingness to
co-operate. These partitions create four quadrants: “Lead”, “Follow”, “Defend” and “Attack.”
Each quadrant can again be divided into two octants, which create in total eight octants
(Figure 1).
One of the characteristics makes the interpersonal circumplex particularly interesting
for interpersonal communication. The circumplex can predict to some extend what the
position of the listener will be when he reacts on the speaker (Figure 2) (Dijk, 2013; Dijk &
Cremers, 2007; Leary, 1957; Remmerswaal, 2011). “Dominance” will trigger a
complementary response namely, “Submission” and vice versa. “We” or “I” behavior will
trigger a similar response, “We” behavior will trigger “We” behavior and vice versa. The
speaker can thus influence the behavior and emotions, from the listener, by his own
conversational actions (Dijk, 2013; Dijk & Cremers, 2007; Leary, 1957; Remmerswaal,
2011). For example, if a colleague is angry at you, he will attack you from the “Dominance/I”
quadrant and you will defend yourself from the “Submissive/I” quadrant. In this example
“Dominance” triggers a complementary response (“Submissive”) and “I” behavior triggers
“I” behavior.
Automatic detection of interpersonal communication
Leary’s Rose was used by Vaassen and Daelemans (2010; 2011) in a serious gaming
project named “deLearyous.” The deLearyous project aimed at developing a game in which
users can improve their communication skills by interacting with a virtual character. In order
to apply this framework, they gathered data from series of experiments whereby the virtual
agent was replaced with a human actor (Wizard of Oz setting) (Vaassen & Daelemans, 2010;
Vaassen, 2014). After gathering the data, they transcripted, analysed and annotated the
obtained data. Vaassen and Daelemans used several machine learning algorithms to reach
Figure 2. Response prediction according to Leary’s Rose. Dominance will trigger a submissive response. I behavior triggers I behavior and We behavior triggers We behavior
52,5% of correctly classified sentences based on the four quadrants; “Lead”, “Follow”,
“Defend” and “Attack.” This is higher than the random baseline of 25,15%, which is a bit
higher than 25% because of the imbalances in the class distribution, and is a significant
improvement (Vaassen & Daelemans, 2010). In subsequent study they used another classifier
and reached F-scores (accuracy of the classifier) up to 51% based on four quadrants and 31%
based on eight octants. This is again a significant improvement according to the random
baseline (25,4% for the four quadrants, 13,1% for the eight octants). In conclusion, it is
possible to beat the random baseline (Vaassen, 2014; Vaassen & Daelemans, 2010; Vaassen
& Daelemans, 2011; Vaassen et al., 2014) but Vaassen & Daelemans state that it is extremely
difficult to reach an acceptable performance for practical use. The main problem is the human
annotators who experience difficulty with scaling the sentences on Leary’s Rose and will not
always agree on the correct quadrant or octant for a sentence. Other problems are the small
size of the corpora and noisy datasets due to the annotation problem (Vaassen & Daelemans
2010; Vaassen & Daelemans, 2011; Vaassen et al., 2012; Vaassen, 2014).
Another study based on the automatic detection of interpersonal communication is
from Kesthar and others (2014). This study used Leary’s Rose to detect the personality of
players in the Land Science game (an educational game). They used also machine learning
algorithms and concluded that text categorization based on n-grams reached the highest
scores, but a combination-method such as the Linguistic Inquiry and Word Count (LIWC) and
Subjective Lexicons along with n-gram features can achieve better performances. Keshtkar
and others (2014) indicate in their study a disagreement between the human annotators. In this
case the bottleneck is again the categorization by human annotators.
Linquistic Inquiry and Word Count (LIWC)
Kesthar and others (2014) used the LIWC to divide words into psychological meaning
categories. First, the program counts the words from a sentence or text and after that the
LIWC divides the categories trough the total word count. The program presents the
percentage of words in a category. The Dutch version of the 2007 LIWC contains 11091
words in 66 categories and gives equivalent results compared to the English LIWC (Boot,
Zijlstra & Geenen, 2017).
Current study
The goal of this study is to scale relative short text input, in a chatbot environment, on
Leary’s Rose. Furthermore, in phase two a new annotation process, based on self-annotation,
is examined. In phase three we try to find a correlation between NPS and Leary’s Rose and
examined if we could predict NPS based on Leary’s Rose.
The study consists of three phases (Figure 3). The first phase is a text analysis
whereby participants annotate sentences from a chat conversation on Leary’s Rose. These
annotated sentences are the ground truth for the text categorization by classifier A. The
classifier was acquired by a logistic regression on the sentences with the participant scores.
The first hypothesis: We expect to find a higher overall accuracy than the baseline. The
expectation of the higher overall accuracy than the baseline arises from the training of the
dataset. This should improve the classifier in such a way that the classifier categorizes
sentences better than the baseline.
Figure 3. A systematic flowchart of the phases in this study.
The second phase focuses on testing the classifier of phase one (classifier A). Based
on scenarios, another participants group chatted with the chatbot and after the conversation
they annotated their own typed sentences. The annotated sentences by the participant will be
the ground truth of the sentence. The sentence position on Leary’s Rose, obtained from the
classifier A, will be compared with their own ground truth annotated position score. The
second hypothesis: we expect as in phase one, the overall accuracy will be higher than the
baseline. We expect the classifier will still be higher than the random baseline because of the
training by the dataset. This should improve the classifier in such a way that the classifier
categorizes sentences better than the baseline. In the second phase we will also create a new
classifier (classifier B) based on the annotated sentences by the participant from phase two
with a logistic regression. In phase three we compare classifier A with classifier B.
In the last phase, phase three, we predict the NPS with the scores on Leary’s Rose to
measure the effectiveness of the classifiers and to determine if there is a correlation between
the mean scores on Leary’s Rose and the NPS per chat. The third hypothesis: We expect to
find a correlation between a high NPS and “Submissive/We” scores on Leary’s Rose and a
low NPS and “Dominance/I” scores. This expectation is based on the findings from Shaw
(2016): the higher the NEV (positive emotion), the higher the NPS.
As described above, we will compare the overall accuracy from the classifier A with
the overall accuracy from classifier B. This overall accuracy score will be obtained from
correct scored NPS sentences. The fourth hypothesis: We expected to find a higher
improvement of the overall accuracy from the classifier B from phase two, than from the
classifier A from phase one. This expectation is based on the fact that the classifier B from
phase two is obtained from sentences which are typed and annotated by the same participant.
The methodology, results and a short discussion from the three phases will be
discussed separately to keep the structure well-ordered and as simple as possible. In the next
sections the methods, results and discussion of each of three phases are presented. In the final
section a general discussion and we offer some ideas for future research.
Phase one
Methods
Design. The first phase was a text analysis by participants to score sentences on
Leary’s Rose. Every participant got 11 random opening sentences and 6 random dialogs to
score. The participant scores determined the sentence place on Leary’s Rose. The scores were
obtained from a combination of the I (against) and We (together) scales and were measured
on a ratio scale from -100 to 100 and the Dominance and the Submissive variable with a ratio
scale from -100 to 100. These scores differed per participant because their scores on a
sentence were subjected to moods, emotions and personality. To counterbalance this
phenomenon, we asked as many different participants as possible and this leveled out the
differences in scores. We also used some control variables (gender, age and education level)
to determine the generalizability. After the text analysis by participants we created, through
the LIWC, a classifier A to categorize sentences on Leary’s Rose by a formula.
Participants. The text categorization, 101 participants participated in our study. These
101 respondents scored only a part of the sentences to keep it short and easy. At least two
annotators per sentence were needed to compare the annotation scores and we reached on
average five participants per sentence. This procedure differed from other studies in
annotating sentences. Most of the studies used up to four participants to score all the
experienced in the practice of Leary’s Rose. Scoring sentences on Leary’s Rose with
inexperienced participants wasn’t done before, but annotation on emotions, with no training,
was done by Aman and Szpakowicz (2007). In addition: this was a quick and cost-saving
possibility compared to the text categorization method from other studies.
The participants were recruited by social media (Facebook) and through e-mail and
met the following criteria: the participants are familiar with a computer using the internet,
Dutch speaking (first language) and in the age-range 18 to 65. These criteria decreased the
chance of mistakes made by not understanding the task.
Participants who did not finish the task were deleted and if a participant differed
significantly on their answers, the answers were checked manually. When the participants
finished their tasks, they were rewarded with the possibility of winning one of the five online
retailer gift cards from 20 euro.
All the tasks were reviewed by the psychology ethics board from Leiden University
and the study meets up with all the applicable laws and guidelines.
Procedure. The data from the first phase was collected via an online survey. The
survey started with an informed consent to inform the participants about their rights
(Appendix A), criteria and the general procedure. After accepting the informed consent, the
participant got an explanation on how to score the sentences. The explanation was followed
up with an example to give the participant some more clarification about the tasks. This
example showed first the sentence, then a glider for the Dominance – Submissive scale,
followed by a glider for the I – We scale, and ended with an optional question: “Based on
which words did you scale the sentences?” The last question showed us some underlying
information about the participants criteria.
The first task contained of 11 openings sentences which were randomly assigned to
the participants and after completion of these 11 sentences the participant got information
about the second task. The second task was similar to the first task but instead of openings
sentences the participant got six dialogs between the user and the chatbot. These dialogs
contained a few sentences to give the participant the feeling that they read a real conversation
to improve the scores on the sentences. Every sentence in the dialog was scored separately in
the same way as in the first task: First the sentence, then two gliders followed up by the
“Based on which words did you scale the sentences?” question.
After completion of the second task the participant got another questionnaire with
some control variables (gender, age and education level) and a few general questions (if there
were some uncertainties and if they liked the tasks). At last the participant could fill in their
e-mail addresses for the draw of the gift cards, followed up with the debriefing (Appendix B).
To complete all tasks, it took about 15 minutes.
Apparatus. In the first phase we compiled a dataset of 250 real chatbot conversations
of the subject ‘where is my package from an external seller?’ (from the chatbot from the
online retailer). These chat conversations took place between 1 December 2016 and 28
February 2017. After reviewing these 250 conversations we excluded 8 conversations because
these users weren’t serious about the topic and were testing the chatbot with strange
sentences. Then we separated the opening sentences and the dialogs, because an opening
sentence is self-contained and most of the time the opening sentence contained a lot of
information. Thereby, not every dialog was usable because of closed-ended questions from
the chatbot. However, a nominable number of dialogs contained a lot of information and the
development of a dialog was very useful (Vaassen & Daelemans, 2011). Based on these
findings we used both opening sentences and dialogs, 203 opening sentences and 110 dialogs.
The participants were presented a link in a Facebook advertisement or an e-mail to a
Qualtrics questionnaire. Qualtrics is an online tool for questionnaires and can be filled in with
a computer, tablet or smartphone with internet. We recommended to use a computer because
it is easier to fill in the questionnaire. The questionnaire was pre-tested by two participants to
check on time spent and understandability.
To analyze the data, we used IBM SPSS 23. To classify words per quadrant, we used
the LIWC. The LIWC counts the words from a sentence or text and divide the words into
psychological meaningful categories. Afterwards they divide the number of words in a
category through the total word count. The program presents the percentage of words per
sentence in a category. The first LIWC application was developed as part of a study of
language and disclosure (Pennebaker, 1993; Tausczik & Pennebaker, 2010) and the LIWC is
through the years translated into several different languages. The 2001 LIWC dictionary was
the first which was translated into Dutch (Zijlstra, Meerveld, Middendorp, Pennebaker &
Geenen, 2004) and Boot, Zijlstra and Geenen (2017) translated the 2007 version. The Dutch
version of the 2007 LIWC contains 11091 words in 66 categories and give equivalent results
compared to the English LIWC, except for a small number of categories. This is because of
difference in word use, homonymous or the less suitable test corpus for some of the categories
(Boot et al., 2017).
Analysis. The obtained data is inserted in IBM SPSS 23 and was checked on outliers.
The participants answers were checked on task completion, total time (boxplot on task
duration), descriptives (mean, minimum and maximum) and with a frequency analysis on
gender, age, education level and general thoughts about the questionnaire. When the data
differed from others, the answers were checked manually. When a participant did not
complete the task or has given strange answers their data was deleted and the participant was
excluded from the study.
The usable data in SPSS is transposed to analyze the data by sentence and not by
participant. Then we computed the Mean, Standard deviation, Minimum and Maximum on
the annotated sentence scores. Also, we computed a count on respondents per sentence. After
these calculations we computed the nominal scores for Leary’s Rose. The ratio scores on the I
(< 0) and We (> 0) and the Dominant (> 0) and Submissive (< 0) scale were transformed into
one of the four quadrants, for instance a sentence with a score on the I & We scale of 64 and
the score on the Dominant & Submissive scale of -67 was transformed into quadrant
“Submissive/We.”
Before we inserted the sentences with the participants scores into the LIWC, we
created an inclusion criterion by the mean and Standard deviation based on the raw scores.
The selected sentences for the analysis were scored higher than 15 on both scales (I & We and
Dominant & Submissive) and a standard deviation of 50 or higher on both scales. Sentences
which did not meet the inclusion criteria were deleted. This approach was used to filter the
strong scored sentences from the weak sentences in order to train the classifier on a stronger
database.
For each sentence, the scores resulting from annotation by the participants were
inserted per quadrant into the LIWC. The LIWC categorized the sentences in their categories.
With a T-test on the percentage of words per quadrant we could find differences per quadrant.
For example, someone typed to the chatbot the sentence: “You are stupid” which has a
“Dominance/I” ground truth. The three words were categorized by the LIWC, whereof stupid
is categorized in the category “negative emotion.” After categorizing all the sentences, the
LIWC category “negative emotion” is significantly more common in the “Dominance/I”
sentences. This category is then used as a predictor for the “Dominance/I” quadrant. These
measurements were inserted in a new SPSS datasheet to create the classifier.
At last, a backwards logistic regression is performed on the LIWC output to determine
the best predictor word-groups for the quadrants on Leary’s Rose. Through a binary logistic
regression analysis, a classifier (classifier A) is created to rate new sentences on Leary’s rose.
After the completion of the classifier the cut-off point is optimized through a ROC curve on
quadrant scores and p values (Tosteson & Begg, 1988). Optimizing the cut-off point
counterbalanced the deviation of the classifier and therefore the classifier reached a higher
accuracy. This deviation arose due to imbalances in the number of sentences per quadrant.
The classifier is skew distributed towards the quadrant with the most sentences. This classifier
is used in phase two to automatically categorize sentences on Leary’s Rose.
Results
Participants. The questionnaire was finished by 101 participants, 38 males and 63
females. 67 percent of the participants were between the age of 18 and 29 and more than 86
percent of the participant’s education level was HBO (higher Vocational Education) or
higher. We examined all cases on time, descriptives and missing data. More than half of the
participants did not finish the task, those were deleted. Based on spend time, three participants
were outliers, these outliers were checked manually and filled in the questionnaire normally.
On the other data we did not find outliers. Besides task completion, we did not need to
remove participants.
Sentences selection. The participants scored 537 unique sentences. One sentence did
not belong to a chatbot conversation, by our mistake, and was deleted. After the deletion, we
transposed the data in SPSS. We calculated the mean, standard deviation, minimum and
maximum score for every sentence score on both scales. We used the inclusion criteria to
select the most important sentences: the standard deviation should be lower than 50, the mean
score 15 or higher and every sentence should be scored by two or more annotators. With the
inclusion criteria we selected 212 sentences (Table 1). Because of the low number of
“I/Submissive” (3,3%) and “We/Dominance” (0.9%) sentences we decided to only use the
“Dominance/I” (74.5%) and “Submissive/We” (21.2%) quadrants. This resulted in 203
sentences.
Baseline. The baseline for this two-class classification is 77,8 percent. This is higher
than a 50% baseline due to imbalances in the class distribution. The baseline is based on the
category with the most sentences, in this case the “Dominance/I” category. This category has
158 sentences in a database of 203 sentences. 158 divided by 203 multiplied by 100 is 77,8%.
Classifier A should score more than 77,8% of the sentences in the correct quadrant of Leary’s
Rose to perform better than the random baseline.
LIWC. The 203 sentences with Leary’s Rose nominal scores are inserted in the
LIWC. Every sentence is categorized separately, based on the LIWC dictionary, and the
category scores are inserted in SPSS. An independent sample T-test determined the
significant different categorizations based on the two quadrants. First, we check if the data fits
the assumptions of an independent sample t-test: 1. All the observations should be
independent and in this test all the observations are independent. 2. Normality, all the data
must follow a normal distribution in the population when the samples are smaller than 25
units. Our samples are bigger than 25 units and therefore we do not violate assumption 2. 3.
Homogeneity, the standard deviation should be fairly equal in both populations and in this
study some of the variables violated this assumption. Still we proceed with this t-test because
of the large number of units.
The following categories; Dictionary cover, we, past, number, affect, positive emotion,
negative emotion, cause, relative, time, work and assent scored significant or almost
significant (Table 2).
Table 1
Sentence selection after inclusion
Quadrant Number of cases Percentage
I/Dom 158 74,5
We/Dom 2 0,9
I/Sub 7 3,3
We/Sub 45 21,2
Logistic Regression. To obtain a classifier we conducted a binary logistic regression
analysis which creates a formula. This formula is classifier A. The LIWC categories described
above are used in this logistic regression. The quadrant scores are the outcome variables. The
LIWC categories are the independent variables and should not highly correlate with each
other. Another assumption is that the number of covariates, the LIWC categories, should be as
low as possible without decreasing the overall percentage of correctly categorized sentences.
With a backward selection procedure, we removed in every new logistic regression
analysis the least significant category till the above described rules fitted the binary logistic
regression. After the backward selection procedure, we reached an overall score of 80.8
percent based on the “we”, “positive emotion” and “relative” variables. All the variables have
a significant contribution in the classification formula and the Omnibus Test of Model is also
significant. This means that our new model (80.8%) is better than the baseline (77.8%) by
three percent.
To achieve a higher overall percentage of correct categorized sentences we improved
the cut value of the binary logistic regression. Our classifier A reached the highest score by a
Table 2.
t-test on LIWC categories per quadrant.
Sub/We Dom/I N M SD N M SD t-test df p-value Dictionary cover 45 89,27 12,66 158 81,98 21,67 2,85 123,84 ,005** We 45 ,54 1,84 158 ,09 ,69 2,56 201 ,011* Past 45 1,10 2,93 158 2,65 5,91 -2,42 149,18 ,017* Number 45 ,38 2,16 158 2,18 6,15 -3,07 193,03 ,002** Affect 45 13,98 26,94 158 4,28 15,12 2,32 52,14 ,025* Positive emotion 45 13,54 27,05 158 2,00 11,60 2,79 48,69 ,008** Negative emotion 45 ,44 1,72 158 2,19 9,91 -2,10 184,70 ,037* Cause 45 ,20 1,36 158 1,92 6,37 -3,15 193,44 ,002** Relative 45 9,13 10,19 158 17,45 17,44 -4.05 123,97 ,000*** Time 45 3,62 6,92 158 10,34 14,82 -4,29 158,27 ,000*** Work 45 ,29 1,12 158 2,90 10,06 -3,19 169,72 ,002** Assent 45 9,18 20,85 158 3,34 14,86 1,76 57,31 ,084
Sub/We = Submissive/We, Dom/I = Dominance/I, N= Number of cases, M = Mean, SD = Standard Deviation, df = Degrees of Freedom
cut value of .65 found by the ROC curve. The percentage of correct sentences is increased to
82.3 percent which is an increase of 4,5 percent. The classifier (classifier A) created by this
binary logistic regression is:
e
(1.027 - .368 * we - .029 * positive emotion + .038 * relativity)P =
_________________________________________________________________1 + e
(1.027 - .368 * we - .029 * positive emotion + .038 * relativity)Discussion
In the first phase we conducted a text analysis to scale sentences from a chat
conversation on Leary’s Rose. Participants annotated sentences, these annotations were the
ground truth for training the classifier A. The hypothesis was: “We expect to find a higher
overall accuracy than the baseline. The conclusion for this hypothesis is that it is possible to
increase the accuracy of the categorization by Leary’s Rose. It is a improvement of 4,5
percent.
First of all, it is noteworthy to mention that, after a close examination, our chatbot
conversations were mostly found in two quadrants of Leary’s Rose. This could be due to the
type of conversations between the chatbot and the user. The first type of conversation is the
“it goes well/I got my answer” conversation. The user is happy with the answer from the
chatbot and their problem is solved. The second type of conversation is “This is not what I
want” conversation. The chatbot does not understand the user or the user is unhappy with the
given answer by the chatbot. In the first type, the conversation belongs to the
“Submissive/We” quadrant and in the second type the conversation belongs to the
“Dominance/I” quadrant. The other quadrants do not seem to fit the chatbot conversations on
the subject “Where is my package from an external seller”. Another explanation is the
differences in communication styles between human vs human and human vs chatbot (Hill et
al., 2015; Mou & Xu, 2017).
If we compare the results with the study from Vaassen and Daelemans (2010; 2011),
Vaassen et al. (2012) we find similar results. Vaassen and Daelemans also found an
improvement in categorizing sentences by a classifier and this improvement was even higher
than the improvement we found. Our lower classification improvement can be due to the
shorter text input and a different approach. For our research we use the chatbot from an online
retailer which was a rule based chatbot. Because of the rule based chatbot we also chose for a
rule-based bag of words approach for our classification framework and not a machine learning
approach such as Vaassen and Daelemans (2010; 2011), Vaassen et al. (2012). In addition,
Vaassen and Daelemans used a training set with more than 1000 sentences, which is bigger
than ours and they used annotators who were trained with the usage of Leary’s Rose.
Another important factor for the slight improvement is the annotation problem as
described by Vaassen (2014). According to Vaassen (2014) the problem starts with the data
collection and manual annotation. Human annotators often disagree of the position on Leary’s
Rose. This results in small, noisy and low-agreement datasets. This challenge is also noticed
in our annotation process and is reflected in the high standard deviation scores.
A possible solution for this problem is a follow-up study to annotate the sentences by
the writers of these sentences. The participants should be better in annotating their own
sentences, because they know the purpose and meaning of their sentences. This solution was
tested in phase two.
Phase two
Methods
Design. The second phase contained a control part with a follow-up part. With the
control part we tested the validity of the classifier A and with the follow-up part we created a
new classifier (classifier B). In the control part, the participants were asked to maintain two
different conversations with the chatbot, based on two scenarios, namely a “Dominant/I”
scenario and a “Submissive/We” scenario (Appendix C). The conversations were in the
follow-up study annotated by the same participants. They rated their own sentences on one
ratio scale, the “Dominant/I” versus the “Submissive/We scale.” The annotated sentences by
the participants and the predefined nominal scenario scores (“Dominant/I” or
“Submissive/We”) were the ground truth for testing classifier A. After testing classifier A, we
used the annotated sentences, as ground truth, to create a new classifier (classifier B).
The annotated sentence scored differed per participant despite every participant
received the same two scenarios’. This ensured external validity because every participant
was subjected to emotions and personality. To concede a well-balanced mean score, we asked
different respondents and this leveled out the difference in scores. Also, the participant could
have learned from the first scenario and to counterbalance for this issue we randomized the
scenarios.
At last we checked all data on time spent, task completion and descriptives (mean,
minimum and maximum of the annotated sentence scores). If there were odd answers we
checked them manually and removed these outliers.
Participants. We recruited 29 participants, through Facebook and e-mail, like the first
phase. The criteria for the participants were; to be familiar with a computer using the internet,
an age-range of 18 to 65, Dutch speaking (first language) and they did not participate in the
first phase. The participants were also excluded when they did not finish the questionnaire or
the follow-up questionnaire and all the questions are checked manually on odd answers. The
reward for participating in this study is the chance to win a gift card from an online retailer of
20 euro.
None of the earlier studies about text analysis and Leary’s Rose used participants to
validate their own outcomes. However, using 20 participants or more is in line with the
studies from Settanni and Marengo (2015) and Georgaca and Avdi (2011). Settanni and
Marengo (2015) used 20 participants to analyze emotions in Facebook posts and based on this
study we also used 20 participants. Georgaca and Avdi (2011) confirmed the number of 20
participants or more, to validate outcomes, in their guide “Discourse Analysis.”
This phase is reviewed by the psychology ethics board from Leiden University and the
study meets up with all the applicable laws and guidelines.
Procedure. The chatbot, was used to have conversations with the participants. These
conversations were according to two scenarios on the subject ‘Where is my package from an
external seller?’ One scenario is focused on the Dominant/I quadrant and the other scenario
was based on the Submissive/We quadrant (see appendix C). The participants were asked to
empathize
with the tasks in the scenario. Imitate behavior and emotions was possible in a
diversity of tests (Grubb & McDaniel, 2007; Keen, 2006; McFarland, Ryan & Ellis, 2002)
and based on these studies we expected that the participants could empathize behavior.
The second phase started with data collection via an online questionnaire. The
questionnaire was attainable through a link on Facebook or in an e-mail. The questionnaire
started with an informed consent (Appendix D) to inform the participants about the criteria,
the general procedure and their rights. The informed consent was followed up by more
information about the task. This contained some basic information about a chatbot and more
extensive explanation about the procedure. Then the participant received one of the two
scenarios, the order was randomized. After reading the casus the participant started the
conversation through a link. The chatbot finishes the conversation when the participant asked
everything they needed to know. In the chatbot conversation the participant was asked to fill
in their e-mail. E-mail address were used to; connect the questionnaire data with the chatbot
data
,send the follow-up questionnaire and to assign the gift card to the winner.
When the conversation was finished the participant received the other scenario. This
procedure repeated the same way as described above. After the second conversation the
participant was navigated back to the questionnaire. In this questionnaire they were asked to
answer some demographic questions (gender, age and education level) and a few general
questions (if there were some uncertainties and if they liked the tasks). At last the participant
filled in the questionnaire their e-mail address followed up with the debriefing (Appendix E).
The total time to complete the whole task was about 15 minutes.
In the second part of phase two the respondents were asked to fill in a follow-up
questionnaire which was sent to their e-mail. In this questionnaire they annotated their own
sentences on a Dominant/I - Submissive/We scale. Participants could only apply this study
when the first questionnaire was completely filled in.
The questionnaire started with the informed consent (Appendix F) and was followed
up with information about the task. This task is almost the same task as in phase one, the only
difference is the scale, which is now one scale (Dominant/I – Submissive/We). The
instruction is exactly the same as in phase one. A scale ranging from -100 to 100. After the
annotating of their own sentences the participant was asked to fill in a few general questions
(if there were some uncertainties and if they liked the tasks) and to give their e-mail address.
At last the participant could read the debriefing (Appendix G). This questionnaire took about
5 minutes.
Apparatus. The questionnaire was an online Qualtrics questionnaire. Qualtrics is an
online tool for questionnaires and can be filled in with a computer, tablet or smartphone with
internet. We recommended to use a computer because it was easier to fill in the questionnaire.
The first task of phase two was pre-tested by one participant to check on time spent and
understandability. The second task isn’t pre-tested because it is almost the same task as phase
one.
Analysis. In phase one we trained classifier A and this classifier was tested in the
current phase. The test-set contained self-annotated sentences by participants. The sentences
written by the participants were scored by classifier A and compared with the ground truth.
The ground truth was the predefined scenario (Dominant/I or Submissive/We) or the
self-annotated scores from the participants gained by the follow-up study. To clarify, the scores by
classifier A were compared with both the predefined scenarios and the self-annotated scores
from the participants.
First, we checked all data from the participants on task completion, time spent and
some descriptive (mean, minimum and maximum). If a participant differed from other
participants, their answers were checked manually and were deleted when needed.
Second, we wrote two Python codes (Appendix H) to analyze the sentences on the
formula created by phase one. The first code is classifier A without a range and the second
Python code computed the sentences based on classifier A with range. A range could improve
the accuracy of the classifier because close to the cutoff point could be a mix of
“Dominance/I” and “Submissive/We” sentences (Jones, 2016; Lord, 1961). For example, if
we got a cutoff point of .65 we can add a range to this cut off point from .60 to .70. This
means that scores between .60 and.70 were not classified in order to avoid mismatches.
The data from Python, the two scenarios and the follow-questionnaire were inserted
into SPSS 23. Outliers were manually checked and all the sentences with missing participant
scores were deleted. The sentences with a zero score, calculated by classifier A, were also
deleted since these sentences did not have matching words in the wordlist. Thereafter we
computed the frequencies from the sentences scored by classifier A and the participants.
Afterwards we computed the sentences which were equally scored by classifier A, scenarios
and the participant. This was followed up with another frequency analysis on the participant
sentence scores after selecting only the equally scored sentences. Based on the frequency
tables we computed the percentage of correct scored sentences by the classifier. This analysis
is repeated with data from the Python program with ranges.
In the second part of phase two we created a new formula (classifier B) by repeating a
part of the analysis from phase one. The sentences from the follow-up study were analyzed by
the LIWC and with the results of this data we created classifier B with a logistic regression
analysis. A ROC-curve determined the cutoff point and range. After this analysis we got a
new classifier for phase three.
Results
Participants. In the questionnaire of phase two 29 participants participated. We
examined the 29 cases on time and missing data. 15 cases had to be removed because
participants did not finish the questionnaire or the follow-up study. From the 14 cases we did
not have to remove any other participants. Five of our participants were men and nine were
women. Except for two, all the participants were between the 18 and 29. Eleven participant’s
education level was HBO (higher Vocational Education) or higher. The other participants
education level was lower than HBO.
Sentence selection. The conversations with the chatbot yielded 454 sentences (Table
3). 235 sentences from the “Submissive/We” scenario and 219 sentences from the
“Dominance/I” scenario. After the follow-up study 145 sentences were scored on Leary’s
Rose by the participants. The not scaled sentences were short sentences and contained
answers like “yes” or “no”. Comparing the predefined casus quadrant score and the
participant quadrant score, almost 30% did not match. After a closer review we decided to
only use the follow-up quadrant scores because many participants scored in the
“Submissive/We” scenario sentences as “Dominance/I” and this affected the reliability of the
predefined scenarios (Table 4). This decision is based on a Fisher’s exact test 2-sided (p =
.000, FET) which concludes that the distributions are different. We chose the Fisher’s test
since two cells have an expected count of less than five.
The 309 sentences which were not scored by a follow-up study were deleted. 145
sentences of which, 57 sentences on “Submissive/We”, 81 sentences on “Dominance/I” and 7
sentences neither on “Submissive/We” or “Dominance/I.” These sentences where scored as
zero on “Submissive/We” and zero on “Dominance/I” (Table 4).
Table 3
A crosstab of the sentence selection by predefined scenarios and follow-up
Follow up None Sub/We Dom/I Not
scaled Total Scenario Sub/We N 3 46 24 162 235 % 0.7% 10.1% 5.3% 35.7% 51.8% Dom/I N 4 11 57 147 219 % 0.9% 2.4% 12.6% 32.4% 48.2% Total N 7 57 81 309 454 % 1.5% 12.6% 17.8% 68.1% 100%
Table 5 gives an overview of sentences from the follow-up study and the classified sentences
by classifier A, after the deletion of the not scaled sentences. The 145 follow-up sentences
from the participants were inserted in a Python program to score the sentences on classifier A.
The formula classified 105 sentences and the 40 sentences “not being classified” had no
words in one of the dictionaries. These sentences were also deleted since we focused on only
classified sentences. These 105 sentences were the base for the control part and the
improvement part. In Table 6 the sentence selection after the deletion of the non-scaled
sentences from the classifier is displayed.
Baseline control study. The statistical baseline for this classification problem is 59,0
percent. The baseline is based on the category with the most sentences (Table 6), in this case
the “Dominance/I” category (59,0%). This is higher than a 50% baseline due to imbalances in
Table 4
A crosstab of the sentence selection by predefined scenarios and follow-up after deletion of the not scaled sentences.
Follow up
None Sub/We Dom/I Total
Scenario Sub/We N 3 46 24 73 % 2.1% 31.7% 16.6% 50.3% Dom/I N 4 11 57 72 % 2.8% 7.6% 39.9% 49.7% Total N 7 57 81 145 % 4.8% 39.3% 55.9% 100%
Table 5
A crosstab: Sentence selection and classified sentences after deletion of not scaled sentences
Follow up
None Sub/We Dom/I Total
Classifier None N 4 17 19 40 % 2.8% 11.7% 13.1% 27.6% Sub/We N 0 4 4 8 % .0 2.8% 2.8% 5.5% Dom/I N 3 36 58 97 % 2.8% 7.6% 39.9% 66.9% Total N 7 57 81 145 % 4.8% 39.3% 55.9% 100%
the class distribution. Classifier A should score more than 59,0% of the sentences in the
correct quadrant of Leary’s Rose to improve the classification distribution problem.
Control study. After analyzing the wrongly scored sentences (Table 6), we argued
that the single cut-off point is not precise enough. A range could improve the accuracy of the
classifier because around the .65 cut-off point are wrongly classified sentences. If we add a
range on the cut-off point from .55 to .75 we reach a classification score of 60.8% which is
almost an improvement of 2%.
A new classification formula. The 145 sentences from the follow-up study were used
as ground truth to create a new classification formula (classifier B). From the 145 sentences
are 57 on “Submissive/We”, 81 on “Dominance/I” and 7 sentences neither on
“Submissive/We” or “Dominance/I.” The 7 sentences were removed because these sentences
have a zero score on both quadrants.
The statistical baseline is 59.1%. This baseline is higher than a 50% baseline because
of imbalances in the class distribution. Classifier B should score higher than 59.1% to reach
an improvement for this classification problem.
All the 138 sentences were inserted in the LIWC. Every sentence is categorized
separately and the category scores were inserted in SPSS. An independent sample T-test
determined the significant different categorizations based on the two quadrants. The
categories; six letter words, affect, positive emotion, negative emotion, insight, relativity, time
and money scored significant or almost significant (Table 7).
Table 6
A crosstab: Sentence selection and classified sentences after deletion of non-scaled sentences
Follow up
None Sub/We Dom/I Total
Classifier Sub/We N 0 4 4 8 % .0 3.8% 3.8% 7.6% Dom/I N 3 36 58 97 % 2.9% 34.4% 55.2% 92.4% Total N 3 40 62 105 % 2.9% 38.1% 59.0% 100%
Logistic Regression for a new classifier. The LIWC categories described above were
used in a logistic regression. The quadrant scores were the outcome variables and the LIWC
categories were the covariates which should not highly correlate with each other. Another
assumption was that the number of covariates should be as low as possible without decreasing
the overall percentage of correctly categorized sentences. The regression analysis met up all
the assumptions.
With a backwards selection procedure, we removed, in every new logistic regression
the least significant category till the above described assumption fitted the binary logistic
regression. After the backwards selection procedure and with a cut-off value of .536, we
reached an overall score of 72.3% with the positive emotion, negative emotion, relativity and
insight variables in the equation (classifier B1).
One variable, namely money, increased the overall percentage of the classification
with more than 4%. Therefore, we decided to create a second classifier (classifier B2) with the
variable money in the equation. Classifier B reached an overall percentage of 76.6% with a
cut-off value of .51.
All the variables, in both regression analysis, were significant and the Omnibus Tests
of Model was also significant. This means that both of our classifiers B1 and B2 (without and
Table 7.
t-test on LIWC categories per quadrant.
Sub/We I/Dom
N M SD N M SD t-test df p-value
Six letter words 56 21.01 14.63 81 13.78 15.48 2,77 122,56 ,006**
Affect 56 6,84 13,73 81 2,40 6,83 2,50 135 ,014* Positive emotion 56 6,33 13,78 81 0,36 2,81 3,79 135 ,000*** Negative emotion 56 0,51 2,30 81 1,74 5,00 -1,72 135 ,088 Insight 56 4,15 7,23 81 1,59 4,25 2,60 135 ,010* Relativity 56 11,78 11,84 81 20,42 15,80 -3,47 135 ,001** Time 56 5,95 8,41 81 11,38 13,05 -2,74 135 ,007** Money 56 2,80 6,17 81 0,69 2,33 2,80 135 ,006**
Sub/We = Submissive/We, Dom/I = Dominance/I, N= Number of cases, M = Mean, SD = Standard Deviation, df = Degrees of Freedom
with money) are better than the baseline by 13.2% and 17.5%. The classifiers created by the
binary logistic regressions are:
Classifier B1:
e
(0.146 - .126 * positive emotion + .071 * negative emotion + .033 * relativity -.057 * insight)P =
______________________________________________________________________________________________1 + e
(0.146 - .126 * positive emotion + .071 * negative emotion + .033 * relativity -.057 * insight)Classifier B2:
e
(0.609 -.136 * positive emotion +.056 * negative emotion +.024 * relativity -.075 * insight -.151 * money)P =
_______________________________________________________________________________________________1 + e
(0.609 -.136 * positive emotion +.056 * negative emotion +.024 * relativity -.075 * insight -.151 * money)Discussion
In phase two we tested classifier A and created two new classifiers B1 and B2 by creating two
new formula’s. The ground truth for testing classifier A was possible by two methods. The
first method was using the follow-up study in which the participant who typed the sentences
also annotated their own sentences. The second method was using the predefined scenarios.
These predefined scenarios turned out to be not reliable with a chi-square. The participants
used many “Dominance/I” sentences in the “Submissive/We” scenarios. After a close review
we concluded that if the chatbot did not answer as expected the participant used
“Dominance/I” sentences, even in the “Submissive/We” scenarios. Due to the unreliable
predefined scenarios scores classifier A was only tested with the self-annotated sentence
scores from the participants.
The hypothesis we tested in phase two was: “The overall accuracy of classifier A will
be higher than the baseline.” In the contrary of our expectations, classifier A showed no
improvements in categorizing the sentences in the test part of phase two. The classifier
reached the same percent of correct scored sentences as the baseline (59%). Notable is the
small number of categorized sentences in the “Submissive/We” quadrant and the huge
number of categorized sentences in the “Dominance/I” quadrant. Probably due to the
training-set from phase one, which created classifier A. The data in the training-training-set was not well
balanced, noisy and small. This is the reason we can’t accept the hypothesis.
We argued that a ranged cut-off point could improve the results (Jones, 2016; Lord,
1961) by the classifiers and that a single cut-off point was possibly not precise enough
because close to the cut-off point there could be a mix of “Dominance/I” and
“Submissive/We” sentence. After adding a ranged cut-off point the categorization was
slightly improved (2%). We expected that a ranged cut-off point is not the solution to reach
huge improvements.
After testing classifier A we created two new classifiers (classifier B1 and B2).
Classifier B2 had an extra variable, namely money. This variable showed in the binary
logistic regression a significant improvement (4,3%) but we expect that this variable was only
an important factor in the categorization with this dataset because money words can be used
in both “Dominance/I” and “Submissive/We” quadrants. In phase three we test these new
classifiers based on NPS and we expect to improve the results with the new classifiers
because participants with no training on Leary’s Rose should have more problems with rating
sentences than self-rating participants.
Phase three
Methods
Design. The last phase, phase three, is a study to predict the NPS score of a
conversation with the classifier score of this particular conversation. The conversations we
used, were from the database of the chatbot from the online retailer and these conversations
were real conversations between customers and the chatbot. All these conversations have an
NPS score and with Python we computed the quadrant score. In this study the NPS score and
the quadrant score should show a correlation and this correlation should support the
effectiveness of the classifiers.
The NPS is a method to indicate customer satisfaction. This method is based on one
simple question; “How likely is it that you would recommend this company to a friend or
colleague?” The question is answered on an eleven-point scale. The scores from zero to six
are the detractors and are unhappy customers. The scores seven and eight are the passives and
are satisfied customers but unenthusiastic. The remaining scores, nine and ten, are the
promoters and are happy and enthusiastic customers who recommend your company to
friends and colleagues. The total NPS score for a company is the subtraction of the percentage
of the detractors from the percentage of the promoters (Mattrox II, 2013; Reichheld, 2003).
We expected that emotion is an important factor between the NPS score and Leary’s
Rose quadrants. The “Dominant/I” scale should correlate negative on the NPS score and the
“Submissive/We” scale should correlate positive on the NPS score. Our expectation is based
on the findings of Shaw (2016) who described that emotion has a moderating effect on the
NPS and thereby emotion is an important factor in Leary’s Rose. Based on this data we
expect we can predict NPS on the Leary scores.
Participants. In this phase we did not need participants because we used existing
anonymized conversations from a database and these conversations had already NPS scores.
The conversations were from a real world setting and written by real customers. We selected
303 anonymous conversations from the subject “Where is my package from an external
seller.”
Procedure. A database from the chatbot was selected with almost equal numbers of
NPS groups; detractors with an NPS score from zero to six, passives with a scores seven and
eight and promoters with the NPS scores nine and ten. The Python programs (Appendix H)
scaled the whole conversation and the separate sentences per conversation. Every
conversation had an NPS score, a quadrant classification based on the whole conversation and
a quadrant classification per sentence. The whole conversation scores were computed by the
Python program which took all the sentences and scaled them as one. The separate sentence
scores were measured individually and afterwards the mean of the sentences scores were
computed by conversation.
Apparatus. In this experiment Python was used to classify conversations and
sentences by the classifiers and SPSS 23 for the analysis.
Analysis. The data from a Python program and the NPS data from the database were
inserted in a SPSS dataset. For every Python program (Table 8) we used a different dataset
and also the ‘whole conversations scores’ and ‘the separate sentence scores’ were inserted in
different SPSS datasets. The conversations with no quadrant score were indicated as missing.
A frequency analysis created a quick overview of the data about the number of conversations,
the number of cases per quadrant and the number of cases per NPS group. Afterwards we
computed a Kendall’s tau-b correlation on the p-value of the quadrant score and the NPS. As
explained before detractors should correlate high with the “Dominant/I” quadrant and the
promoters should correlate high on the “Submissive/We quadrant.” A crosstab on quadrant
and NPS groups measured the cases of NPS groups per quadrant. The passive NPS group was
not important in our study and was not be taken into account because this group couldn’t be
predicted by the classifiers. With the crosstab we computed the percentage of correctly scored
quadrant per NPS group. Detractor with quadrant “I/Dominance” and promoter with quadrant
“We/Submissive”. This analysis was repeated for every classifier (see Table 8). Afterwards
we compared the scores of the three classifiers.
Variables in classifier; A: we, positive emotion, relativity; B1: positive & negative emotion, relativity, insight B2; positive & negative emotion, relativity, insight, money
Results
Chat conversations. In total 303 chat conversations with NPS scores were separated
into the three NPS groups, namely the detractors with 102 conversations, the passives with
101 conversations and the promoters with 100 conversations. These conversations were
scored by the Python programs on the whole conversation scores and on the separate sentence
scores for all of the three classifiers A, B1 and B2 (Table 8). The subsequent part discusses
the results of classifier A, followed by the results of classifier B1 and at last the results of
classifier B2.
Classifier A. For explorative purpose, we used every classifier in four different ways.
The Python program with the classifier scaled the whole conversation with range, the whole
conversation without range, the sentences in the conversation with and without range. In
Table 9 are listed the most important outcomes of classifier A.
We used a Kendall’s tau-b correlation because our data was skewed and therefore we
could not perform a Pearsons correlation. The assumptions for Kendall’s tau-b: variables
Table 8
The classifiers with their specifications and their corresponding Python codes.
Classifier Content Fitting method Fitting configuration Python code
A conversation range .55 - .75 1 no range .65 sentence range .55 - .75 2 no range .65 B1 conversation range .45 - .55 3 no range .54 sentence range .45 - .55 4 no range .54 B2 conversation range .40 - .60 5 no range .51 sentence range .40 - .60 6 no range .51