A chabot with interpersonal communication recognition: determine the position on Leary's rose after automatic text analyzation

(1)

Name: Gerard Johan Visser Student number: S1068008 Date: 20-08-2018

Supervisor: Dr. P. Haazebroek Second reader: Dr. R. E. de Kleijn

A chatbot with interpersonal

communication recognition: determine

the position on Leary’s Rose after

(2)

Abstract

The research area of chatbots is relatively young and therefore the goal of this study is to

gather more information about text categorization by chatbots. At this time chatbots are

increasingly used in online communication with users. It is a challenge to let chatbots respond

appropriately on an emotional level in a way that users experience the answers as a positive

interaction. The current study examined the possibility of mapping this interaction using text

analysis with the LIWC and classification on Leary’s Rose. Based on Leary’s Rose we

predicted the positive experience with NPS. This study consists of three phases. The first

phase is a text analysis to scale sentences on Leary’s Rose. The sentences were scaled by 102

participants on two scales (the “I & We” scale and the “Dominance & Submissive” scale).

With these scaled sentences, a classifier (classifier A) is created and trained with the LIWC

and a regression analysis. The results of phase one suggests that our database contains mostly

“Dominance/I” and “Submissive/We” sentences. Classifier A (80.8%) is 3% better than the

random baseline (77.8%). Classifier A is tested in phase two with self-annotated sentences.

These self-annotated sentences were from 15 participants on two scenarios’. Based on these

self-annotated sentences we created also two new classifiers (B1 & B2). The test results of

classifier A (59.0%) is the same as the (59.0%) random baseline. The two new classifiers (B1

& B2) created in phase two performed better (B1: 72.3% and B2: 76.6%) than the random

baseline (59.1%). In phase three we tried to predict NPS on Leary’s rose. Based on Kendalls

tau-b correlation and crosstabs we compared the classifiers. The findings suggested that it is

possible to predict NPS based on Leary’s Rose. A possible implication is the need of a

multimodal approach for text analysis. Future research should focus on better ways of

annotation to prevent skewed, small and noisy databases. More implications and suggestions

are presented in the discussion.

Keywords: emotion detection, emotion classification, text analysis, Leary’s Rose,

interpersonal communication, chatbot, Net Promoter Score (NPS)

(3)

A chatbot with interpersonal communication recognition:

determine the position on Leary’s Rose after automatic text analyzation.

In recent years, more and more chatbots became available in different areas. Via

chatbot-software a human is able to interact with a computer in natural language. This

software can extend daily life, for example a helpdesk chatbot (Rahman, 2012; Shawar,

Atwell & Roberts, 2005) who is able to answer questions from customers. A customer would

like to have information about a problem with a product and asks the chatbot. The chatbot is

able to answer the question. But chatbots are also used in areas such as in educational tools

(Keshtkar, Burkett, Li & Graesser, 2014; Vaassen & Daelemans, 2010) or in e-commerce and

business (Chattaraman, Kwon & Gilbert, 2012). Because chatbots are used more and more,

improvements should be made and one of the challenges is to detect emotion from the user in

a chatbot conversation. To return to our ‘helpdesk chatbot’ example; imagine a helpdesk

chatbot which can detect someone’s emotion. The chatbot is then able to change its type of

communication based on the emotion of the user in a way he or she feels more understood.

Thereby, the chatbot is able to recognize when the conversation goes sideways and transfer

the customer towards a real human being. To reach this goal more research is needed. This

study focuses on emotion classification of customer conversations with a helpdesk chatbot

from a large online retailer.

Chatbots

A chatbot is a computer program designed to communicate with human users via

natural language. The chatbot can recognize words or a group of words and based on this data

the chatbot gives answers. This type of chatbot has certain benefits. First of all, a chatbot is

always present and takes care for real-time events 24 hours a day. In addition, the

communication by a chatbot is a dialog and is more effective than a monolog, when dealing

with humans (Tatai, Csordás, Kiss, Szaló & Laufer, 2003). Furthermore, the chatbot combines

large amounts of information and only shows the information that is asked for by the user.

And at last, a chatbot can handle many cases simultaneously, which is cost-saving for the

company because fewer employees are needed to answer the questions of the users.

However, at this moment, there are challenges to deal with due to the complexity of human

language and emotions. Computers have difficulties in understanding the endless variability

of expression in how words are meant in language use to communicate meaning (Hill, Ford &

Farreras, 2015). To create a computer program that is capable to interact with a person at a

(4)

human level, requires the machine to understand human behaviors. One of the most important

things in a conversation is expressing and understanding emotions and affects (Picard &

Picard, 1997; Salovey & Mayer, 1990).

The possibility to test humanized machines was proposed by Alan Turing in his

“Turing Test” (Turing, 1950). This test is based on a conversation between a computer and a

human judge. It is based on the ability of a computer program to impersonate a human, with

the judge not being able to distinguish between a computer of a human being. One of the first

chatbots subjected to the Turing Test was ELIZA, which is created at the Massachusetts

Institute of Technology by Weizenbaum. ELIZA is a chatbot that emulates a psychotherapist

(Weizenbaum, 1966). After ELIZA, a lot of other chatbots are created with different purposes.

Still none of the chatbots passed the Turing Test (Saygin, Cicekli & Akman, 2000; Warwick

& Shah, 2016).

Emotion and classifying emotions

As described above, emotion is an important factor in humanizing computers.

Classifying customers emotions is important for companies because emotion has an effect on

customer loyalty and satisfaction (DeWitt, Nguyen & Marshall, 2008; Varela-Neira,

Vázquez-Casielles & Iglesias-Argüelles, 2008; Yu, White & Xu, 2007). To measure customer

satisfaction and loyalty a company can use the Net Promoter Score (NPS). Shaw (2016)

introduced a new indicator for emotional value, the Net Emotion Value (NEV). The NEV

measures the emotion value towards a company. His work shows that the higher the NEV

(positive emotion), the higher the NPS and thus has emotion an effect on the NPS.

To extract emotion from text in a chat environment is different than extracting

emotion from face to face interactions between humans. Emotion extraction from text is

missing facial expressions, intonation of voice and body language (Vaassen, 2014). This

complicates the emotion extraction from text. Another difference is the difference in

communication styles in a human-chatbot versus a human-human chat-conversation. Users

tend to be more agreeable, open, extrovert, conscientious, and self-disclosing when

interacting with a human. When a human consciously interacts with a chatbot they report

lower perceived attractiveness, less goal driven and more brutal language than in chats with

humans (Mou & Xu, 2017). Hill and others (2015) found differences between human-human

and human-chatbot communication in: more messages, shorter message lengths, more limited

(5)

vocabulary and greater use of profanity. These differences in interaction should be taken into

account when classifying emotion from chatbot texts.

Interpersonal interaction

To achieve emotion recognition in a chat conversation a real time automatic emotion

analysis is needed. In 2010, Vaassen and Daelemans introduced the automatic classification

of text according to a framework for interpersonal communication (Vaassen & Daelemans,

2010; Vaassen & Daelemans, 2011; Vaassen, Wauters, van Broeckhoven, van Overveldt,

Daelemans & Eneman, 2012; Vaassen, 2014). This approach focuses not only on emotion

classification but also the interaction between a human and a chatbot, which is very helpful.

Several different frameworks for interpersonal communication have been developed

over the past years (Gurtman, 2009). The first interpersonal communication model was

created by the Kaiser Research Group by the name interpersonal circle, better known as

“Leary’s Rose” (Leary, 1957). This framework determines two roles, a speaker and a listener.

These two roles change during the conversation, when someone speaks, he is the speaker and

when someone listens, he is the listener. The graphical representation of Leary’s Rose is a

circle which is split vertically in the “I” and “We” side and horizontally in the “Dominance”

and “Submission” side. The horizontal axis determines if the speaker is dominant or

submissive towards the listener. The vertical axis determines the speaker’s willingness to

co-operate. These partitions create four quadrants: “Lead”, “Follow”, “Defend” and “Attack.”

Each quadrant can again be divided into two octants, which create in total eight octants

(Figure 1).

(6)

One of the characteristics makes the interpersonal circumplex particularly interesting

for interpersonal communication. The circumplex can predict to some extend what the

position of the listener will be when he reacts on the speaker (Figure 2) (Dijk, 2013; Dijk &

Cremers, 2007; Leary, 1957; Remmerswaal, 2011). “Dominance” will trigger a

complementary response namely, “Submission” and vice versa. “We” or “I” behavior will

trigger a similar response, “We” behavior will trigger “We” behavior and vice versa. The

speaker can thus influence the behavior and emotions, from the listener, by his own

conversational actions (Dijk, 2013; Dijk & Cremers, 2007; Leary, 1957; Remmerswaal,

2011). For example, if a colleague is angry at you, he will attack you from the “Dominance/I”

quadrant and you will defend yourself from the “Submissive/I” quadrant. In this example

“Dominance” triggers a complementary response (“Submissive”) and “I” behavior triggers

“I” behavior.

Automatic detection of interpersonal communication

Leary’s Rose was used by Vaassen and Daelemans (2010; 2011) in a serious gaming

project named “deLearyous.” The deLearyous project aimed at developing a game in which

users can improve their communication skills by interacting with a virtual character. In order

to apply this framework, they gathered data from series of experiments whereby the virtual

agent was replaced with a human actor (Wizard of Oz setting) (Vaassen & Daelemans, 2010;

Vaassen, 2014). After gathering the data, they transcripted, analysed and annotated the

obtained data. Vaassen and Daelemans used several machine learning algorithms to reach

Figure 2. Response prediction according to Leary’s Rose. Dominance will trigger a submissive response. I behavior triggers I behavior and We behavior triggers We behavior

(7)

52,5% of correctly classified sentences based on the four quadrants; “Lead”, “Follow”,

“Defend” and “Attack.” This is higher than the random baseline of 25,15%, which is a bit

higher than 25% because of the imbalances in the class distribution, and is a significant

improvement (Vaassen & Daelemans, 2010). In subsequent study they used another classifier

and reached F-scores (accuracy of the classifier) up to 51% based on four quadrants and 31%

based on eight octants. This is again a significant improvement according to the random

baseline (25,4% for the four quadrants, 13,1% for the eight octants). In conclusion, it is

possible to beat the random baseline (Vaassen, 2014; Vaassen & Daelemans, 2010; Vaassen

& Daelemans, 2011; Vaassen et al., 2014) but Vaassen & Daelemans state that it is extremely

difficult to reach an acceptable performance for practical use. The main problem is the human

annotators who experience difficulty with scaling the sentences on Leary’s Rose and will not

always agree on the correct quadrant or octant for a sentence. Other problems are the small

size of the corpora and noisy datasets due to the annotation problem (Vaassen & Daelemans

2010; Vaassen & Daelemans, 2011; Vaassen et al., 2012; Vaassen, 2014).

Another study based on the automatic detection of interpersonal communication is

from Kesthar and others (2014). This study used Leary’s Rose to detect the personality of

players in the Land Science game (an educational game). They used also machine learning

algorithms and concluded that text categorization based on n-grams reached the highest

scores, but a combination-method such as the Linguistic Inquiry and Word Count (LIWC) and

Subjective Lexicons along with n-gram features can achieve better performances. Keshtkar

and others (2014) indicate in their study a disagreement between the human annotators. In this

case the bottleneck is again the categorization by human annotators.

Linquistic Inquiry and Word Count (LIWC)

Kesthar and others (2014) used the LIWC to divide words into psychological meaning

categories. First, the program counts the words from a sentence or text and after that the

LIWC divides the categories trough the total word count. The program presents the

percentage of words in a category. The Dutch version of the 2007 LIWC contains 11091

words in 66 categories and gives equivalent results compared to the English LIWC (Boot,

Zijlstra & Geenen, 2017).

Current study

The goal of this study is to scale relative short text input, in a chatbot environment, on

Leary’s Rose. Furthermore, in phase two a new annotation process, based on self-annotation,

(8)

is examined. In phase three we try to find a correlation between NPS and Leary’s Rose and

examined if we could predict NPS based on Leary’s Rose.

The study consists of three phases (Figure 3). The first phase is a text analysis

whereby participants annotate sentences from a chat conversation on Leary’s Rose. These

annotated sentences are the ground truth for the text categorization by classifier A. The

classifier was acquired by a logistic regression on the sentences with the participant scores.

The first hypothesis: We expect to find a higher overall accuracy than the baseline. The

expectation of the higher overall accuracy than the baseline arises from the training of the

dataset. This should improve the classifier in such a way that the classifier categorizes

sentences better than the baseline.

Figure 3. A systematic flowchart of the phases in this study.

The second phase focuses on testing the classifier of phase one (classifier A). Based

on scenarios, another participants group chatted with the chatbot and after the conversation

they annotated their own typed sentences. The annotated sentences by the participant will be

the ground truth of the sentence. The sentence position on Leary’s Rose, obtained from the

classifier A, will be compared with their own ground truth annotated position score. The

second hypothesis: we expect as in phase one, the overall accuracy will be higher than the

baseline. We expect the classifier will still be higher than the random baseline because of the

training by the dataset. This should improve the classifier in such a way that the classifier

categorizes sentences better than the baseline. In the second phase we will also create a new

classifier (classifier B) based on the annotated sentences by the participant from phase two

with a logistic regression. In phase three we compare classifier A with classifier B.

In the last phase, phase three, we predict the NPS with the scores on Leary’s Rose to

measure the effectiveness of the classifiers and to determine if there is a correlation between

the mean scores on Leary’s Rose and the NPS per chat. The third hypothesis: We expect to

(9)

find a correlation between a high NPS and “Submissive/We” scores on Leary’s Rose and a

low NPS and “Dominance/I” scores. This expectation is based on the findings from Shaw

(2016): the higher the NEV (positive emotion), the higher the NPS.

As described above, we will compare the overall accuracy from the classifier A with

the overall accuracy from classifier B. This overall accuracy score will be obtained from

correct scored NPS sentences. The fourth hypothesis: We expected to find a higher

improvement of the overall accuracy from the classifier B from phase two, than from the

classifier A from phase one. This expectation is based on the fact that the classifier B from

phase two is obtained from sentences which are typed and annotated by the same participant.

The methodology, results and a short discussion from the three phases will be

discussed separately to keep the structure well-ordered and as simple as possible. In the next

sections the methods, results and discussion of each of three phases are presented. In the final

section a general discussion and we offer some ideas for future research.

Phase one

Methods

Design. The first phase was a text analysis by participants to score sentences on

Leary’s Rose. Every participant got 11 random opening sentences and 6 random dialogs to

score. The participant scores determined the sentence place on Leary’s Rose. The scores were

obtained from a combination of the I (against) and We (together) scales and were measured

on a ratio scale from -100 to 100 and the Dominance and the Submissive variable with a ratio

scale from -100 to 100. These scores differed per participant because their scores on a

sentence were subjected to moods, emotions and personality. To counterbalance this

phenomenon, we asked as many different participants as possible and this leveled out the

differences in scores. We also used some control variables (gender, age and education level)

to determine the generalizability. After the text analysis by participants we created, through

the LIWC, a classifier A to categorize sentences on Leary’s Rose by a formula.

Participants. The text categorization, 101 participants participated in our study. These

101 respondents scored only a part of the sentences to keep it short and easy. At least two

annotators per sentence were needed to compare the annotation scores and we reached on

average five participants per sentence. This procedure differed from other studies in

annotating sentences. Most of the studies used up to four participants to score all the

(10)

experienced in the practice of Leary’s Rose. Scoring sentences on Leary’s Rose with

inexperienced participants wasn’t done before, but annotation on emotions, with no training,

was done by Aman and Szpakowicz (2007). In addition: this was a quick and cost-saving

possibility compared to the text categorization method from other studies.

The participants were recruited by social media (Facebook) and through e-mail and

met the following criteria: the participants are familiar with a computer using the internet,

Dutch speaking (first language) and in the age-range 18 to 65. These criteria decreased the

chance of mistakes made by not understanding the task.

Participants who did not finish the task were deleted and if a participant differed

significantly on their answers, the answers were checked manually. When the participants

finished their tasks, they were rewarded with the possibility of winning one of the five online

retailer gift cards from 20 euro.

All the tasks were reviewed by the psychology ethics board from Leiden University

and the study meets up with all the applicable laws and guidelines.

Procedure. The data from the first phase was collected via an online survey. The

survey started with an informed consent to inform the participants about their rights

(Appendix A), criteria and the general procedure. After accepting the informed consent, the

participant got an explanation on how to score the sentences. The explanation was followed

up with an example to give the participant some more clarification about the tasks. This

example showed first the sentence, then a glider for the Dominance – Submissive scale,

followed by a glider for the I – We scale, and ended with an optional question: “Based on

which words did you scale the sentences?” The last question showed us some underlying

information about the participants criteria.

The first task contained of 11 openings sentences which were randomly assigned to

the participants and after completion of these 11 sentences the participant got information

about the second task. The second task was similar to the first task but instead of openings

sentences the participant got six dialogs between the user and the chatbot. These dialogs

contained a few sentences to give the participant the feeling that they read a real conversation

to improve the scores on the sentences. Every sentence in the dialog was scored separately in

the same way as in the first task: First the sentence, then two gliders followed up by the

“Based on which words did you scale the sentences?” question.

(11)

After completion of the second task the participant got another questionnaire with

some control variables (gender, age and education level) and a few general questions (if there

were some uncertainties and if they liked the tasks). At last the participant could fill in their

e-mail addresses for the draw of the gift cards, followed up with the debriefing (Appendix B).

To complete all tasks, it took about 15 minutes.

Apparatus. In the first phase we compiled a dataset of 250 real chatbot conversations

of the subject ‘where is my package from an external seller?’ (from the chatbot from the

online retailer). These chat conversations took place between 1 December 2016 and 28

February 2017. After reviewing these 250 conversations we excluded 8 conversations because

these users weren’t serious about the topic and were testing the chatbot with strange

sentences. Then we separated the opening sentences and the dialogs, because an opening

sentence is self-contained and most of the time the opening sentence contained a lot of

information. Thereby, not every dialog was usable because of closed-ended questions from

the chatbot. However, a nominable number of dialogs contained a lot of information and the

development of a dialog was very useful (Vaassen & Daelemans, 2011). Based on these

findings we used both opening sentences and dialogs, 203 opening sentences and 110 dialogs.

The participants were presented a link in a Facebook advertisement or an e-mail to a

Qualtrics questionnaire. Qualtrics is an online tool for questionnaires and can be filled in with

a computer, tablet or smartphone with internet. We recommended to use a computer because

it is easier to fill in the questionnaire. The questionnaire was pre-tested by two participants to

check on time spent and understandability.

To analyze the data, we used IBM SPSS 23. To classify words per quadrant, we used

the LIWC. The LIWC counts the words from a sentence or text and divide the words into

psychological meaningful categories. Afterwards they divide the number of words in a

category through the total word count. The program presents the percentage of words per

sentence in a category. The first LIWC application was developed as part of a study of

language and disclosure (Pennebaker, 1993; Tausczik & Pennebaker, 2010) and the LIWC is

through the years translated into several different languages. The 2001 LIWC dictionary was

the first which was translated into Dutch (Zijlstra, Meerveld, Middendorp, Pennebaker &

Geenen, 2004) and Boot, Zijlstra and Geenen (2017) translated the 2007 version. The Dutch

version of the 2007 LIWC contains 11091 words in 66 categories and give equivalent results

compared to the English LIWC, except for a small number of categories. This is because of

(12)

difference in word use, homonymous or the less suitable test corpus for some of the categories

(Boot et al., 2017).

Analysis. The obtained data is inserted in IBM SPSS 23 and was checked on outliers.

The participants answers were checked on task completion, total time (boxplot on task

duration), descriptives (mean, minimum and maximum) and with a frequency analysis on

gender, age, education level and general thoughts about the questionnaire. When the data

differed from others, the answers were checked manually. When a participant did not

complete the task or has given strange answers their data was deleted and the participant was

excluded from the study.

The usable data in SPSS is transposed to analyze the data by sentence and not by

participant. Then we computed the Mean, Standard deviation, Minimum and Maximum on

the annotated sentence scores. Also, we computed a count on respondents per sentence. After

these calculations we computed the nominal scores for Leary’s Rose. The ratio scores on the I

(< 0) and We (> 0) and the Dominant (> 0) and Submissive (< 0) scale were transformed into

one of the four quadrants, for instance a sentence with a score on the I & We scale of 64 and

the score on the Dominant & Submissive scale of -67 was transformed into quadrant

“Submissive/We.”

Before we inserted the sentences with the participants scores into the LIWC, we

created an inclusion criterion by the mean and Standard deviation based on the raw scores.

The selected sentences for the analysis were scored higher than 15 on both scales (I & We and

Dominant & Submissive) and a standard deviation of 50 or higher on both scales. Sentences

which did not meet the inclusion criteria were deleted. This approach was used to filter the

strong scored sentences from the weak sentences in order to train the classifier on a stronger

database.

For each sentence, the scores resulting from annotation by the participants were

inserted per quadrant into the LIWC. The LIWC categorized the sentences in their categories.

With a T-test on the percentage of words per quadrant we could find differences per quadrant.

For example, someone typed to the chatbot the sentence: “You are stupid” which has a

“Dominance/I” ground truth. The three words were categorized by the LIWC, whereof stupid

is categorized in the category “negative emotion.” After categorizing all the sentences, the

LIWC category “negative emotion” is significantly more common in the “Dominance/I”

(13)

sentences. This category is then used as a predictor for the “Dominance/I” quadrant. These

measurements were inserted in a new SPSS datasheet to create the classifier.

At last, a backwards logistic regression is performed on the LIWC output to determine

the best predictor word-groups for the quadrants on Leary’s Rose. Through a binary logistic

regression analysis, a classifier (classifier A) is created to rate new sentences on Leary’s rose.

After the completion of the classifier the cut-off point is optimized through a ROC curve on

quadrant scores and p values (Tosteson & Begg, 1988). Optimizing the cut-off point

counterbalanced the deviation of the classifier and therefore the classifier reached a higher

accuracy. This deviation arose due to imbalances in the number of sentences per quadrant.

The classifier is skew distributed towards the quadrant with the most sentences. This classifier

is used in phase two to automatically categorize sentences on Leary’s Rose.

Results

Participants. The questionnaire was finished by 101 participants, 38 males and 63

females. 67 percent of the participants were between the age of 18 and 29 and more than 86

percent of the participant’s education level was HBO (higher Vocational Education) or

higher. We examined all cases on time, descriptives and missing data. More than half of the

participants did not finish the task, those were deleted. Based on spend time, three participants

were outliers, these outliers were checked manually and filled in the questionnaire normally.

On the other data we did not find outliers. Besides task completion, we did not need to

remove participants.

Sentences selection. The participants scored 537 unique sentences. One sentence did

not belong to a chatbot conversation, by our mistake, and was deleted. After the deletion, we

transposed the data in SPSS. We calculated the mean, standard deviation, minimum and

maximum score for every sentence score on both scales. We used the inclusion criteria to

select the most important sentences: the standard deviation should be lower than 50, the mean

score 15 or higher and every sentence should be scored by two or more annotators. With the

inclusion criteria we selected 212 sentences (Table 1). Because of the low number of

“I/Submissive” (3,3%) and “We/Dominance” (0.9%) sentences we decided to only use the

“Dominance/I” (74.5%) and “Submissive/We” (21.2%) quadrants. This resulted in 203

sentences.

(14)

Baseline. The baseline for this two-class classification is 77,8 percent. This is higher

than a 50% baseline due to imbalances in the class distribution. The baseline is based on the

category with the most sentences, in this case the “Dominance/I” category. This category has

158 sentences in a database of 203 sentences. 158 divided by 203 multiplied by 100 is 77,8%.

Classifier A should score more than 77,8% of the sentences in the correct quadrant of Leary’s

Rose to perform better than the random baseline.

LIWC. The 203 sentences with Leary’s Rose nominal scores are inserted in the

LIWC. Every sentence is categorized separately, based on the LIWC dictionary, and the

category scores are inserted in SPSS. An independent sample T-test determined the

significant different categorizations based on the two quadrants. First, we check if the data fits

the assumptions of an independent sample t-test: 1. All the observations should be

independent and in this test all the observations are independent. 2. Normality, all the data

must follow a normal distribution in the population when the samples are smaller than 25

units. Our samples are bigger than 25 units and therefore we do not violate assumption 2. 3.

Homogeneity, the standard deviation should be fairly equal in both populations and in this

study some of the variables violated this assumption. Still we proceed with this t-test because

of the large number of units.

The following categories; Dictionary cover, we, past, number, affect, positive emotion,

negative emotion, cause, relative, time, work and assent scored significant or almost

significant (Table 2).

Table 1

Sentence selection after inclusion

Quadrant Number of cases Percentage

I/Dom 158 74,5

We/Dom 2 0,9

I/Sub 7 3,3

We/Sub 45 21,2

(15)

Logistic Regression. To obtain a classifier we conducted a binary logistic regression

analysis which creates a formula. This formula is classifier A. The LIWC categories described

above are used in this logistic regression. The quadrant scores are the outcome variables. The

LIWC categories are the independent variables and should not highly correlate with each

other. Another assumption is that the number of covariates, the LIWC categories, should be as

low as possible without decreasing the overall percentage of correctly categorized sentences.

With a backward selection procedure, we removed in every new logistic regression

analysis the least significant category till the above described rules fitted the binary logistic

regression. After the backward selection procedure, we reached an overall score of 80.8

percent based on the “we”, “positive emotion” and “relative” variables. All the variables have

a significant contribution in the classification formula and the Omnibus Test of Model is also

significant. This means that our new model (80.8%) is better than the baseline (77.8%) by

three percent.

To achieve a higher overall percentage of correct categorized sentences we improved

the cut value of the binary logistic regression. Our classifier A reached the highest score by a

Table 2.

t-test on LIWC categories per quadrant.

Sub/We Dom/I N M SD N M SD t-test df p-value Dictionary cover 45 89,27 12,66 158 81,98 21,67 2,85 123,84 ,005** We 45 ,54 1,84 158 ,09 ,69 2,56 201 ,011* Past 45 1,10 2,93 158 2,65 5,91 -2,42 149,18 ,017* Number 45 ,38 2,16 158 2,18 6,15 -3,07 193,03 ,002** Affect 45 13,98 26,94 158 4,28 15,12 2,32 52,14 ,025* Positive emotion 45 13,54 27,05 158 2,00 11,60 2,79 48,69 ,008** Negative emotion 45 ,44 1,72 158 2,19 9,91 -2,10 184,70 ,037* Cause 45 ,20 1,36 158 1,92 6,37 -3,15 193,44 ,002** Relative 45 9,13 10,19 158 17,45 17,44 -4.05 123,97 ,000*** Time 45 3,62 6,92 158 10,34 14,82 -4,29 158,27 ,000*** Work 45 ,29 1,12 158 2,90 10,06 -3,19 169,72 ,002** Assent 45 9,18 20,85 158 3,34 14,86 1,76 57,31 ,084

Sub/We = Submissive/We, Dom/I = Dominance/I, N= Number of cases, M = Mean, SD = Standard Deviation, df = Degrees of Freedom

(16)

cut value of .65 found by the ROC curve. The percentage of correct sentences is increased to

82.3 percent which is an increase of 4,5 percent. The classifier (classifier A) created by this

binary logistic regression is:

e

(1.027 - .368 * we - .029 * positive emotion + .038 * relativity)

P =

_________________________________________________________________

1 + e

(1.027 - .368 * we - .029 * positive emotion + .038 * relativity)

Discussion

In the first phase we conducted a text analysis to scale sentences from a chat

conversation on Leary’s Rose. Participants annotated sentences, these annotations were the

ground truth for training the classifier A. The hypothesis was: “We expect to find a higher

overall accuracy than the baseline. The conclusion for this hypothesis is that it is possible to

increase the accuracy of the categorization by Leary’s Rose. It is a improvement of 4,5

percent.

First of all, it is noteworthy to mention that, after a close examination, our chatbot

conversations were mostly found in two quadrants of Leary’s Rose. This could be due to the

type of conversations between the chatbot and the user. The first type of conversation is the

“it goes well/I got my answer” conversation. The user is happy with the answer from the

chatbot and their problem is solved. The second type of conversation is “This is not what I

want” conversation. The chatbot does not understand the user or the user is unhappy with the

given answer by the chatbot. In the first type, the conversation belongs to the

“Submissive/We” quadrant and in the second type the conversation belongs to the

“Dominance/I” quadrant. The other quadrants do not seem to fit the chatbot conversations on

the subject “Where is my package from an external seller”. Another explanation is the

differences in communication styles between human vs human and human vs chatbot (Hill et

al., 2015; Mou & Xu, 2017).

If we compare the results with the study from Vaassen and Daelemans (2010; 2011),

Vaassen et al. (2012) we find similar results. Vaassen and Daelemans also found an

improvement in categorizing sentences by a classifier and this improvement was even higher

than the improvement we found. Our lower classification improvement can be due to the

shorter text input and a different approach. For our research we use the chatbot from an online

retailer which was a rule based chatbot. Because of the rule based chatbot we also chose for a

(17)

rule-based bag of words approach for our classification framework and not a machine learning

approach such as Vaassen and Daelemans (2010; 2011), Vaassen et al. (2012). In addition,

Vaassen and Daelemans used a training set with more than 1000 sentences, which is bigger

than ours and they used annotators who were trained with the usage of Leary’s Rose.

Another important factor for the slight improvement is the annotation problem as

described by Vaassen (2014). According to Vaassen (2014) the problem starts with the data

collection and manual annotation. Human annotators often disagree of the position on Leary’s

Rose. This results in small, noisy and low-agreement datasets. This challenge is also noticed

in our annotation process and is reflected in the high standard deviation scores.

A possible solution for this problem is a follow-up study to annotate the sentences by

the writers of these sentences. The participants should be better in annotating their own

sentences, because they know the purpose and meaning of their sentences. This solution was

tested in phase two.

Phase two

Methods

Design. The second phase contained a control part with a follow-up part. With the

control part we tested the validity of the classifier A and with the follow-up part we created a

new classifier (classifier B). In the control part, the participants were asked to maintain two

different conversations with the chatbot, based on two scenarios, namely a “Dominant/I”

scenario and a “Submissive/We” scenario (Appendix C). The conversations were in the

follow-up study annotated by the same participants. They rated their own sentences on one

ratio scale, the “Dominant/I” versus the “Submissive/We scale.” The annotated sentences by

the participants and the predefined nominal scenario scores (“Dominant/I” or

“Submissive/We”) were the ground truth for testing classifier A. After testing classifier A, we

used the annotated sentences, as ground truth, to create a new classifier (classifier B).

The annotated sentence scored differed per participant despite every participant

received the same two scenarios’. This ensured external validity because every participant

was subjected to emotions and personality. To concede a well-balanced mean score, we asked

different respondents and this leveled out the difference in scores. Also, the participant could

have learned from the first scenario and to counterbalance for this issue we randomized the

scenarios.

(18)

At last we checked all data on time spent, task completion and descriptives (mean,

minimum and maximum of the annotated sentence scores). If there were odd answers we

checked them manually and removed these outliers.

Participants. We recruited 29 participants, through Facebook and e-mail, like the first

phase. The criteria for the participants were; to be familiar with a computer using the internet,

an age-range of 18 to 65, Dutch speaking (first language) and they did not participate in the

first phase. The participants were also excluded when they did not finish the questionnaire or

the follow-up questionnaire and all the questions are checked manually on odd answers. The

reward for participating in this study is the chance to win a gift card from an online retailer of

20 euro.

None of the earlier studies about text analysis and Leary’s Rose used participants to

validate their own outcomes. However, using 20 participants or more is in line with the

studies from Settanni and Marengo (2015) and Georgaca and Avdi (2011). Settanni and

Marengo (2015) used 20 participants to analyze emotions in Facebook posts and based on this

study we also used 20 participants. Georgaca and Avdi (2011) confirmed the number of 20

participants or more, to validate outcomes, in their guide “Discourse Analysis.”

This phase is reviewed by the psychology ethics board from Leiden University and the

study meets up with all the applicable laws and guidelines.

Procedure. The chatbot, was used to have conversations with the participants. These

conversations were according to two scenarios on the subject ‘Where is my package from an

external seller?’ One scenario is focused on the Dominant/I quadrant and the other scenario

was based on the Submissive/We quadrant (see appendix C). The participants were asked to

empathize

with the tasks in the scenario. Imitate behavior and emotions was possible in a

diversity of tests (Grubb & McDaniel, 2007; Keen, 2006; McFarland, Ryan & Ellis, 2002)

and based on these studies we expected that the participants could empathize behavior.

The second phase started with data collection via an online questionnaire. The

questionnaire was attainable through a link on Facebook or in an e-mail. The questionnaire

started with an informed consent (Appendix D) to inform the participants about the criteria,

the general procedure and their rights. The informed consent was followed up by more

information about the task. This contained some basic information about a chatbot and more

extensive explanation about the procedure. Then the participant received one of the two

scenarios, the order was randomized. After reading the casus the participant started the

(19)

conversation through a link. The chatbot finishes the conversation when the participant asked

everything they needed to know. In the chatbot conversation the participant was asked to fill

in their e-mail. E-mail address were used to; connect the questionnaire data with the chatbot

data

,

send the follow-up questionnaire and to assign the gift card to the winner.

When the conversation was finished the participant received the other scenario. This

procedure repeated the same way as described above. After the second conversation the

participant was navigated back to the questionnaire. In this questionnaire they were asked to

answer some demographic questions (gender, age and education level) and a few general

questions (if there were some uncertainties and if they liked the tasks). At last the participant

filled in the questionnaire their e-mail address followed up with the debriefing (Appendix E).

The total time to complete the whole task was about 15 minutes.

In the second part of phase two the respondents were asked to fill in a follow-up

questionnaire which was sent to their e-mail. In this questionnaire they annotated their own

sentences on a Dominant/I - Submissive/We scale. Participants could only apply this study

when the first questionnaire was completely filled in.

The questionnaire started with the informed consent (Appendix F) and was followed

up with information about the task. This task is almost the same task as in phase one, the only

difference is the scale, which is now one scale (Dominant/I – Submissive/We). The

instruction is exactly the same as in phase one. A scale ranging from -100 to 100. After the

annotating of their own sentences the participant was asked to fill in a few general questions

(if there were some uncertainties and if they liked the tasks) and to give their e-mail address.

At last the participant could read the debriefing (Appendix G). This questionnaire took about

5 minutes.

Apparatus. The questionnaire was an online Qualtrics questionnaire. Qualtrics is an

online tool for questionnaires and can be filled in with a computer, tablet or smartphone with

internet. We recommended to use a computer because it was easier to fill in the questionnaire.

The first task of phase two was pre-tested by one participant to check on time spent and

understandability. The second task isn’t pre-tested because it is almost the same task as phase

one.

Analysis. In phase one we trained classifier A and this classifier was tested in the

current phase. The test-set contained self-annotated sentences by participants. The sentences

written by the participants were scored by classifier A and compared with the ground truth.

(20)

The ground truth was the predefined scenario (Dominant/I or Submissive/We) or the

self-annotated scores from the participants gained by the follow-up study. To clarify, the scores by

classifier A were compared with both the predefined scenarios and the self-annotated scores

from the participants.

First, we checked all data from the participants on task completion, time spent and

some descriptive (mean, minimum and maximum). If a participant differed from other

participants, their answers were checked manually and were deleted when needed.

Second, we wrote two Python codes (Appendix H) to analyze the sentences on the

formula created by phase one. The first code is classifier A without a range and the second

Python code computed the sentences based on classifier A with range. A range could improve

the accuracy of the classifier because close to the cutoff point could be a mix of

“Dominance/I” and “Submissive/We” sentences (Jones, 2016; Lord, 1961). For example, if

we got a cutoff point of .65 we can add a range to this cut off point from .60 to .70. This

means that scores between .60 and.70 were not classified in order to avoid mismatches.

The data from Python, the two scenarios and the follow-questionnaire were inserted

into SPSS 23. Outliers were manually checked and all the sentences with missing participant

scores were deleted. The sentences with a zero score, calculated by classifier A, were also

deleted since these sentences did not have matching words in the wordlist. Thereafter we

computed the frequencies from the sentences scored by classifier A and the participants.

Afterwards we computed the sentences which were equally scored by classifier A, scenarios

and the participant. This was followed up with another frequency analysis on the participant

sentence scores after selecting only the equally scored sentences. Based on the frequency

tables we computed the percentage of correct scored sentences by the classifier. This analysis

is repeated with data from the Python program with ranges.

In the second part of phase two we created a new formula (classifier B) by repeating a

part of the analysis from phase one. The sentences from the follow-up study were analyzed by

the LIWC and with the results of this data we created classifier B with a logistic regression

analysis. A ROC-curve determined the cutoff point and range. After this analysis we got a

new classifier for phase three.

Results

Participants. In the questionnaire of phase two 29 participants participated. We

examined the 29 cases on time and missing data. 15 cases had to be removed because

(21)

participants did not finish the questionnaire or the follow-up study. From the 14 cases we did

not have to remove any other participants. Five of our participants were men and nine were

women. Except for two, all the participants were between the 18 and 29. Eleven participant’s

education level was HBO (higher Vocational Education) or higher. The other participants

education level was lower than HBO.

Sentence selection. The conversations with the chatbot yielded 454 sentences (Table

3). 235 sentences from the “Submissive/We” scenario and 219 sentences from the

“Dominance/I” scenario. After the follow-up study 145 sentences were scored on Leary’s

Rose by the participants. The not scaled sentences were short sentences and contained

answers like “yes” or “no”. Comparing the predefined casus quadrant score and the

participant quadrant score, almost 30% did not match. After a closer review we decided to

only use the follow-up quadrant scores because many participants scored in the

“Submissive/We” scenario sentences as “Dominance/I” and this affected the reliability of the

predefined scenarios (Table 4). This decision is based on a Fisher’s exact test 2-sided (p =

.000, FET) which concludes that the distributions are different. We chose the Fisher’s test

since two cells have an expected count of less than five.

The 309 sentences which were not scored by a follow-up study were deleted. 145

sentences of which, 57 sentences on “Submissive/We”, 81 sentences on “Dominance/I” and 7

sentences neither on “Submissive/We” or “Dominance/I.” These sentences where scored as

zero on “Submissive/We” and zero on “Dominance/I” (Table 4).

Table 3

A crosstab of the sentence selection by predefined scenarios and follow-up

Follow up None Sub/We Dom/I Not

scaled Total Scenario Sub/We N 3 46 24 162 235 % 0.7% 10.1% 5.3% 35.7% 51.8% Dom/I _N ₄ ₁₁ ₅₇ ₁₄₇ ₂₁₉ % 0.9% 2.4% 12.6% 32.4% 48.2% Total _N ₇ ₅₇ ₈₁ ₃₀₉ ₄₅₄ % 1.5% 12.6% 17.8% 68.1% 100%

(22)

Table 5 gives an overview of sentences from the follow-up study and the classified sentences

by classifier A, after the deletion of the not scaled sentences. The 145 follow-up sentences

from the participants were inserted in a Python program to score the sentences on classifier A.

The formula classified 105 sentences and the 40 sentences “not being classified” had no

words in one of the dictionaries. These sentences were also deleted since we focused on only

classified sentences. These 105 sentences were the base for the control part and the

improvement part. In Table 6 the sentence selection after the deletion of the non-scaled

sentences from the classifier is displayed.

Baseline control study. The statistical baseline for this classification problem is 59,0

percent. The baseline is based on the category with the most sentences (Table 6), in this case

the “Dominance/I” category (59,0%). This is higher than a 50% baseline due to imbalances in

Table 4

A crosstab of the sentence selection by predefined scenarios and follow-up after deletion of the not scaled sentences.

Follow up

None Sub/We Dom/I Total

Scenario Sub/We N 3 46 24 73 % 2.1% 31.7% 16.6% 50.3% Dom/I _N ₄ ₁₁ ₅₇ ₇₂ % 2.8% 7.6% 39.9% 49.7% Total _N ₇ ₅₇ ₈₁ ₁₄₅ % 4.8% 39.3% 55.9% 100%

Table 5

A crosstab: Sentence selection and classified sentences after deletion of not scaled sentences

Follow up

Classifier None N 4 17 19 40 % 2.8% 11.7% 13.1% 27.6% Sub/We N 0 4 4 8 % .0 2.8% 2.8% 5.5% Dom/I _N ₃ ₃₆ ₅₈ ₉₇ % 2.8% 7.6% 39.9% 66.9% Total _N ₇ ₅₇ ₈₁ ₁₄₅ % 4.8% 39.3% 55.9% 100%

(23)

the class distribution. Classifier A should score more than 59,0% of the sentences in the

correct quadrant of Leary’s Rose to improve the classification distribution problem.

Control study. After analyzing the wrongly scored sentences (Table 6), we argued

that the single cut-off point is not precise enough. A range could improve the accuracy of the

classifier because around the .65 cut-off point are wrongly classified sentences. If we add a

range on the cut-off point from .55 to .75 we reach a classification score of 60.8% which is

almost an improvement of 2%.

A new classification formula. The 145 sentences from the follow-up study were used

as ground truth to create a new classification formula (classifier B). From the 145 sentences

are 57 on “Submissive/We”, 81 on “Dominance/I” and 7 sentences neither on

“Submissive/We” or “Dominance/I.” The 7 sentences were removed because these sentences

have a zero score on both quadrants.

The statistical baseline is 59.1%. This baseline is higher than a 50% baseline because

of imbalances in the class distribution. Classifier B should score higher than 59.1% to reach

an improvement for this classification problem.

All the 138 sentences were inserted in the LIWC. Every sentence is categorized

separately and the category scores were inserted in SPSS. An independent sample T-test

determined the significant different categorizations based on the two quadrants. The

categories; six letter words, affect, positive emotion, negative emotion, insight, relativity, time

and money scored significant or almost significant (Table 7).

Table 6

A crosstab: Sentence selection and classified sentences after deletion of non-scaled sentences

Follow up

Classifier Sub/We N 0 4 4 8 % .0 3.8% 3.8% 7.6% Dom/I _N ₃ ₃₆ ₅₈ ₉₇ % 2.9% 34.4% 55.2% 92.4% Total _N ₃ ₄₀ ₆₂ ₁₀₅ % 2.9% 38.1% 59.0% 100%

(24)

Logistic Regression for a new classifier. The LIWC categories described above were

used in a logistic regression. The quadrant scores were the outcome variables and the LIWC

categories were the covariates which should not highly correlate with each other. Another

assumption was that the number of covariates should be as low as possible without decreasing

the overall percentage of correctly categorized sentences. The regression analysis met up all

the assumptions.

With a backwards selection procedure, we removed, in every new logistic regression

the least significant category till the above described assumption fitted the binary logistic

regression. After the backwards selection procedure and with a cut-off value of .536, we

reached an overall score of 72.3% with the positive emotion, negative emotion, relativity and

insight variables in the equation (classifier B1).

One variable, namely money, increased the overall percentage of the classification

with more than 4%. Therefore, we decided to create a second classifier (classifier B2) with the

variable money in the equation. Classifier B reached an overall percentage of 76.6% with a

cut-off value of .51.

All the variables, in both regression analysis, were significant and the Omnibus Tests

of Model was also significant. This means that both of our classifiers B1 and B2 (without and

Table 7.

t-test on LIWC categories per quadrant.

Sub/We I/Dom

N M SD N M SD t-test df p-value

Six letter words 56 21.01 14.63 81 13.78 15.48 2,77 122,56 ,006**

Affect 56 6,84 13,73 81 2,40 6,83 2,50 135 ,014* Positive emotion 56 6,33 13,78 81 0,36 2,81 3,79 135 ,000*** Negative emotion 56 0,51 2,30 81 1,74 5,00 -1,72 135 ,088 Insight 56 4,15 7,23 81 1,59 4,25 2,60 135 ,010* Relativity 56 11,78 11,84 81 20,42 15,80 -3,47 135 ,001** Time 56 5,95 8,41 81 11,38 13,05 -2,74 135 ,007** Money 56 2,80 6,17 81 0,69 2,33 2,80 135 ,006**

Sub/We = Submissive/We, Dom/I = Dominance/I, N= Number of cases, M = Mean, SD = Standard Deviation, df = Degrees of Freedom

(25)

with money) are better than the baseline by 13.2% and 17.5%. The classifiers created by the

binary logistic regressions are:

Classifier B1:

e

(0.146 - .126 * positive emotion + .071 * negative emotion + .033 * relativity -.057 * insight)

P =

______________________________________________________________________________________________

1 + e

(0.146 - .126 * positive emotion + .071 * negative emotion + .033 * relativity -.057 * insight)

Classifier B2:

e

(0.609 -.136 * positive emotion +.056 * negative emotion +.024 * relativity -.075 * insight -.151 * money)

P =

_______________________________________________________________________________________________

1 + e

(0.609 -.136 * positive emotion +.056 * negative emotion +.024 * relativity -.075 * insight -.151 * money)

Discussion

In phase two we tested classifier A and created two new classifiers B1 and B2 by creating two

new formula’s. The ground truth for testing classifier A was possible by two methods. The

first method was using the follow-up study in which the participant who typed the sentences

also annotated their own sentences. The second method was using the predefined scenarios.

These predefined scenarios turned out to be not reliable with a chi-square. The participants

used many “Dominance/I” sentences in the “Submissive/We” scenarios. After a close review

we concluded that if the chatbot did not answer as expected the participant used

“Dominance/I” sentences, even in the “Submissive/We” scenarios. Due to the unreliable

predefined scenarios scores classifier A was only tested with the self-annotated sentence

scores from the participants.

The hypothesis we tested in phase two was: “The overall accuracy of classifier A will

be higher than the baseline.” In the contrary of our expectations, classifier A showed no

improvements in categorizing the sentences in the test part of phase two. The classifier

reached the same percent of correct scored sentences as the baseline (59%). Notable is the

small number of categorized sentences in the “Submissive/We” quadrant and the huge

number of categorized sentences in the “Dominance/I” quadrant. Probably due to the

training-set from phase one, which created classifier A. The data in the training-training-set was not well

balanced, noisy and small. This is the reason we can’t accept the hypothesis.

We argued that a ranged cut-off point could improve the results (Jones, 2016; Lord,

1961) by the classifiers and that a single cut-off point was possibly not precise enough

(26)

because close to the cut-off point there could be a mix of “Dominance/I” and

“Submissive/We” sentence. After adding a ranged cut-off point the categorization was

slightly improved (2%). We expected that a ranged cut-off point is not the solution to reach

huge improvements.

After testing classifier A we created two new classifiers (classifier B1 and B2).

Classifier B2 had an extra variable, namely money. This variable showed in the binary

logistic regression a significant improvement (4,3%) but we expect that this variable was only

an important factor in the categorization with this dataset because money words can be used

in both “Dominance/I” and “Submissive/We” quadrants. In phase three we test these new

classifiers based on NPS and we expect to improve the results with the new classifiers

because participants with no training on Leary’s Rose should have more problems with rating

sentences than self-rating participants.

Phase three

Methods

Design. The last phase, phase three, is a study to predict the NPS score of a

conversation with the classifier score of this particular conversation. The conversations we

used, were from the database of the chatbot from the online retailer and these conversations

were real conversations between customers and the chatbot. All these conversations have an

NPS score and with Python we computed the quadrant score. In this study the NPS score and

the quadrant score should show a correlation and this correlation should support the

effectiveness of the classifiers.

The NPS is a method to indicate customer satisfaction. This method is based on one

simple question; “How likely is it that you would recommend this company to a friend or

colleague?” The question is answered on an eleven-point scale. The scores from zero to six

are the detractors and are unhappy customers. The scores seven and eight are the passives and

are satisfied customers but unenthusiastic. The remaining scores, nine and ten, are the

promoters and are happy and enthusiastic customers who recommend your company to

friends and colleagues. The total NPS score for a company is the subtraction of the percentage

of the detractors from the percentage of the promoters (Mattrox II, 2013; Reichheld, 2003).

We expected that emotion is an important factor between the NPS score and Leary’s

Rose quadrants. The “Dominant/I” scale should correlate negative on the NPS score and the

“Submissive/We” scale should correlate positive on the NPS score. Our expectation is based

(27)

on the findings of Shaw (2016) who described that emotion has a moderating effect on the

NPS and thereby emotion is an important factor in Leary’s Rose. Based on this data we

expect we can predict NPS on the Leary scores.

Participants. In this phase we did not need participants because we used existing

anonymized conversations from a database and these conversations had already NPS scores.

The conversations were from a real world setting and written by real customers. We selected

303 anonymous conversations from the subject “Where is my package from an external

seller.”

Procedure. A database from the chatbot was selected with almost equal numbers of

NPS groups; detractors with an NPS score from zero to six, passives with a scores seven and

eight and promoters with the NPS scores nine and ten. The Python programs (Appendix H)

scaled the whole conversation and the separate sentences per conversation. Every

conversation had an NPS score, a quadrant classification based on the whole conversation and

a quadrant classification per sentence. The whole conversation scores were computed by the

Python program which took all the sentences and scaled them as one. The separate sentence

scores were measured individually and afterwards the mean of the sentences scores were

computed by conversation.

Apparatus. In this experiment Python was used to classify conversations and

sentences by the classifiers and SPSS 23 for the analysis.

Analysis. The data from a Python program and the NPS data from the database were

inserted in a SPSS dataset. For every Python program (Table 8) we used a different dataset

and also the ‘whole conversations scores’ and ‘the separate sentence scores’ were inserted in

different SPSS datasets. The conversations with no quadrant score were indicated as missing.

A frequency analysis created a quick overview of the data about the number of conversations,

the number of cases per quadrant and the number of cases per NPS group. Afterwards we

computed a Kendall’s tau-b correlation on the p-value of the quadrant score and the NPS. As

explained before detractors should correlate high with the “Dominant/I” quadrant and the

promoters should correlate high on the “Submissive/We quadrant.” A crosstab on quadrant

and NPS groups measured the cases of NPS groups per quadrant. The passive NPS group was

not important in our study and was not be taken into account because this group couldn’t be

predicted by the classifiers. With the crosstab we computed the percentage of correctly scored

quadrant per NPS group. Detractor with quadrant “I/Dominance” and promoter with quadrant

(28)

“We/Submissive”. This analysis was repeated for every classifier (see Table 8). Afterwards

we compared the scores of the three classifiers.

Variables in classifier; A: we, positive emotion, relativity; B1: positive & negative emotion, relativity, insight B2; positive & negative emotion, relativity, insight, money

Results

Chat conversations. In total 303 chat conversations with NPS scores were separated

into the three NPS groups, namely the detractors with 102 conversations, the passives with

101 conversations and the promoters with 100 conversations. These conversations were

scored by the Python programs on the whole conversation scores and on the separate sentence

scores for all of the three classifiers A, B1 and B2 (Table 8). The subsequent part discusses

the results of classifier A, followed by the results of classifier B1 and at last the results of

classifier B2.

Classifier A. For explorative purpose, we used every classifier in four different ways.

The Python program with the classifier scaled the whole conversation with range, the whole

conversation without range, the sentences in the conversation with and without range. In

Table 9 are listed the most important outcomes of classifier A.

We used a Kendall’s tau-b correlation because our data was skewed and therefore we

could not perform a Pearsons correlation. The assumptions for Kendall’s tau-b: variables

Table 8

The classifiers with their specifications and their corresponding Python codes.

Classifier Content Fitting method Fitting configuration Python code

A conversation range .55 - .75 1 no range .65 sentence range .55 - .75 2 no range .65 B1 conversation range .45 - .55 3 no range .54 sentence range .45 - .55 4 no range .54 B2 conversation range .40 - .60 5 no range .51 sentence range .40 - .60 6 no range .51

(29)

should be at least ordinal and should have a monotonic relationship. The last assumption is

not very strict because this assumption is often able to assess. The data did not fail on the

assumptions for Kendall’s tau-b.

Classifier A shows a slight improvement in three approaches. The correlation between

NPS and p value of the classifier score is significant by ‘classifier A on conversations’ (r

τ

=

-.094, p = .023) and ‘classifier A on sentences’ (r

τ

= -.157, p <.001). Most of the

“Dominance/I” sentences are scored correctly but the “Submissive/We” sentences are mostly

scored wrong. This could explain the low total classification scores and therefore the low

improvement relative to the baseline classification.

The sentence-based classification has the best improvement with a mean improvement

of 4.3%. The classifier without a range showed a better improvement, with a mean score of

3.4%, than the classifier with a range. Classifier A based on sentences without a range scored

an improvement of 4.9%, which is the highest improvement of the different approaches by

classifier A. The mean improvement on the different ways of measurement of classifier A is

2.6%.

Classifier B1. As in phase one, we used every classifier in four different ways, for

explorative purpose. Three approaches of classifier B1 showed a significant improvement

(17.5%, 15.9% & 15.0%; Table 10). The other approach measured a decrease in improvement

of -3.4% relative to the baseline classification. The data did not fail on the assumptions for

Kendall’s tau-b and showed significant correlations in the classifiers (r

τ

= -.170, p < .001) and

Table 9

Outcomes of classifier A Classifier Correlation (R) (sig) scored conversations I/Dom total I/Dom correct We/Sub total We/Sub correct Baseline classification Total classification Improvement conversation, with range -,094 (,023) 199 74 74 50 0 59,7% 59,7% 0,0% conversation, no range -,094 (,023) 297 101 101 98 5 50,8% 52,6% 1,8% sentence, with range -,157 (,000) 208 79 72 59 12 57,2% 60,9% 3,7% sentence, no range -,157 (,000) 297 101 89 98 33 50,8% 55,7% 4,9%