Sentiment Analysis on Dutch Phone Call Conversations

(1)

Sentiment Analysis on Dutch Phone Call Conversations

Delano de Ruiter

DelanodeRuiter@hotmail.com Master information studies: Data Science

Faculty of Science University of Amsterdam

Internal Supervisor External Supervisor Title, Name Dr Bert Bredeweg Wout Hoeve

Affiliation UvA, FNWI, IvI Zonneplan

Email B.Bredeweg@uva.nl Wout@zonneplan.nl

ABSTRACT

In research and business there is a need for capturing subjective information in speech. When sentiment analysis is successfully applied on speech this can be done timely and automatically. In this study speech recognition is used to transcribe Dutch phone calls. Classification is done on transcripts resulting in a positive or negative prediction on sale. This study determines a significant difference in word frequency usage between the ’sale’ and ’no sale’ class. This makes it possible to predict the class of a conversation correctly. It is expected that this approach can be applied to other domains and languages.

KEYWORDS

Automatic Speech Recognition, Sentiment Analysis, Sentiment Clas-sification, Corpus Comparison

1 INTRODUCTION

Companies striving for the best customer satisfaction may benefit from customer sentiment analysis. Customer satisfaction results in ongoing loyalty, increased word-of-mouth, greater brand value, and is correlated with higher earnings [16].

Traditionally, reviews and surveys are used to measure satisfac-tion. These methods can be slow and limited to a small group. Also, customers might see surveys as a nuisance. The advent of social media creates a change in communication between company and customer. The communication is mostly unstructured but still con-tains a lot of information about the customers sentiment. Sentiment classification has been performed successfully on social media [21], but this is still only a small part of the total communication between customer and business.

Nowadays, almost every bit of communication with the cus-tomer is stored. For many companies, communication goes mostly by phone. This research tries to predict customer conversion by sen-timent analysis on phone call transcripts using speech recognition in Dutch. The objective of this study is to determine a difference in word frequency of phone call conversations that lead to a sale or not in the domain of solar panels.

Sentiment analysis on phone calls helps in tailoring to a cus-tomers specific needs and wishes. Also, insights can be gathered on word usage in positive or in negative calls, optimum length of a conversation, recent shifts in sentiment, product specific sentiment, and the list goes on [16].

In recent years the field of automatic speech recognition (ASR) has made great progress and has been adopted in all kinds of appli-cations [14]. Sentiment analysis (SA) also has an extensive body of research and is used a lot in marketing [15]. The combination of these two fields is however not studied extensively, but has great potential [3, 19].

This paper is organized as follows. Section 2 describes related literature. In section 3 the approach of this study is explained. Sec-tion 4 shows the results. In secSec-tion 5 these results are discussed. A conclusion is given in section 6. The last section presents some possible future work.

2 RELATED LITERATURE

This paper combines two fields of research. Namely, automatic speech recognition (ASR) and sentiment analysis (SA). These two fields are reviewed. After that research on speech sentiment analysis is described. The benefits of visualizing sentiment is described. Lastly, suitable statistical tests for textual differences in corpora are given.

(2)

2.1 Automatic Speech Recognition

ASR systems are commonly used in everyday applications [14]. These systems allow natural communication between humans and computers. There are a number of commercial systems such as those by Google, Amazon, and Microsoft and two open-source systems: CMU Sphinx and Kaldi [14]. Most systems use a Hidden Markov Model (HMM). In recent years the results of ASR systems has increased dramatically with the addition of deep learning [14]. A HMM describes the probabilistic relationship between different states. Speech has a temporal structure and can be encoded as a sequence of audio frequencies. Each time step the HMM is able to make a transition to another state [5]. On entering a HMM state, an acoustic feature vector is generated. A word is made up of a combination of these states. The single unit of sound is called a phoneme. For example, in English the phoneme /th/ is a single sound and is often followed by /˘e/ to make the word ’the’.

HMM are the basis of most speech systems, however improve-ments can be made with additions. "Refineimprove-ments include feature projection, improved covariance modelling, discriminative parame-ter estimation, adaptation and normalisation, noise compensation and multi-pass system combination." [5]

In addition to the probabilistic HMM. Three models are used to make the match. Single units of sound are combined to make words. For any combination of sounds the most likely word is composed. This word is then seen in context of its predecessors. Some words follow other words more often. The three models used to make the match are:

– Acoustic model: Finds the most likely phoneme from acous-tic properties. The acousacous-tic properties are taken from a small piece of audio.

– Phonetic dictionary: Contains a mapping from words to phonemes. Some words can be pronounced in different ways, therefore there can be multiple phoneme mappings for one word. The words in this dictionary are the words that can be recognized.

– Language model: restricts the word search by taking previ-ous words in consideration. Some words follow other words more often.

All these components work together to go from audio to text. The acoustic model maps the feature vector to a phoneme. The most likely word for a combination of phonemes is found in the phonetic dictionary. This word is placed in context of the previous found words by the language model [5].

Since individual speech systems work differently they produce different results. The amount of processing that needs to be done also varies for models. The adaptability of models is also different among systems [2].

The method of deployment affects the adaptability. Therefore, it should be taken into account. Below three methods of deploy-ment for a speech system are given with their advantages and disadvantages:

– Cloud solution: A cloud solution abstracts the model from the user. Google Speech API is one that has a Dutch model available [14]. A few parameters can be changed, but options are limited. This makes it easy to use but more limited in changes.

– Server solution: This method makes it possible to tune a model and then deploy it on a server. For example, the Kaldi toolkit is available only on Linux but can be used in a container and then run on any server. A Dutch model is available for Kaldi.

– Local solution: A local solution does processing on the device of recognition. Since speech recognition uses substan-tial processing, it might not be suitable for every scenario. With this method adaptation of models is possible [17]. CMU Sphinx is suitable for local training and is available on Mac and Windows [17].

Creating or adapting a new model is only possible in open-source systems. Two of those systems are Kaldi and CMU Sphinx [2, 17]. Creating or adapting a model for a specific domain or acoustic environment might provide better performance than an existing model.

All speech models depend on data. Training data consists of audio speech accompanied by a transcription. This data might be hard to come by, especially in a language with fewer speakers and a specific domain.

Speaker diarization is the process of partitioning speech by speaker identity. The LIUM tool is able to perform such a task [11]. It must be trained for a specific language of which Dutch is not available. Speakers can be discriminated and parsed as a feature to the transcript [19]. Diarization can also be solved for phone calls when the incoming and outgoing channels are split.

Validation of ASR is commonly done by word error rate (WER). The equation is as follows:

W ER =I + D + S

N × 100% (1)

Where I words are inserted. D words are deleted, and S words are substituted [14].

The word error rate gives a score to compare speech recognition systems to each other. In addition to this the Jaccard index is used to measure similarity between two sets. Namely the recognized transcript and the human validated transcript.

The Jaccard index measures the overlap of two sets. This mea-sure states the percentage of words captured in the transcript in comparison to the real words used. The equation is:

J(A, B) = |A ∩ B|_{|A ∪ B| =} |A ∩ B|

|A| + |B| − |A ∩ B| (2) Speed of speech recognition systems may be different for differ-ent models. Speed is commonly measured in a ratio of recognition time and duration of speech. When this measure is 1 the recognition speed is the same as the duration. On different systems this can be true or not.

Performance of ASR can be influenced by: adjusting or creating a model suitable for the application, matching acoustic environment, bigger but slower models, audio quality, audio level and noise, sample rate and bit rate are matching for train data and test data1.

1_{https://cmusphinx.github.io/wiki/faq/}

(3)

2.2 Sentiment Analysis

Sentiment analysis deals with extracting subjective information from language. There are roughly two approaches [22]. The lin-guistic approach and the machine learning approach. The linlin-guistic based approach uses a dictionary of words with a semantic score. Every token of a text is matched to the dictionary and a score is calculated based on the matching of the words. A dictionary has to be made first for this approach. These can be created manually.

In the machine learning approach a classification algorithm is presented a series of feature vectors of previous data and in the case of supervised classification labels are attached. Feature extraction is a step taken to go from text to a numerical representation. A machine learning algorithm is trained on data. A test set of data is used to measure the performance of a machine learning algorithm. The machine learning approach is usually more adaptable and more accurate [22]. For this study textual data is available as tran-scripts of speech.

2.3 Speech Sentiment Analysis

Sentiment analysis on speech can be done on emotion detection in speech by acoustic features [7] and on textual sentiment from audio [3]. Both approaches can be combined to give a single predic-tion. Textual sentiment has a bigger predictive power than acoustic features [7, 9].

Another approach is doing classification on the transcript of speech [3]. A number of classification algorithms are used of which a Support Vector Machine (SVM) performed the best. This work states that the transcripts represent just 44% of what is originally said and is still able to cluster successfully on sentiment. A limitation of this work is the usage of an artificially generated data set.

Differentiating speakers and performing sentiment analysis on each individual speaker is a viable approach [19]. A shortcoming of this work is also the artificially generated data set. An alternative argument is keeping the conversational structure and performing sentiment analysis on the whole conversation [3].

Besides the multiple options of transcripts, there are also multi-ple approaches to sentiment analysis on transcripts. For text the linguistic based approach or the machine learning approach is used in research [1]. Both approaches are successful in predicting senti-ment. The problem with a linguistic based method is the manual evaluation and judgment of the lexicon.

In short, some research has been done on speech sentiment analysis but more should be done using real data and a machine learning classification approach. Difficulties with audio sentiment analysis are: Speech is often less structured, contains pauses and breaks, and does not always follow the rules of grammar. Also, the speech recognizer makes errors which creates noise in the transcripts.

2.4 Sentiment Visualization

Sentiment visualization has matured from pie and bar charts to extensive visualizations and has become a notable topic of research [13]. Often used techniques in literature are clustering and classi-fication. Comparison and creating an overview is also mentioned often in literature [13].

An information need is not always present during exploration [6]. Therefore, a visual approach is helpful in exploration of sentiment. A visual approach can help guide the user and has a positive effect on engagement. The result suggest that users spend more time performing tasks when using scatter plots. Words can be found more easily that are distinctive for their respective class.

2.5 Corpus Comparison

Comparing word frequencies in corpora can be done with sta-tistical significance testing [8]. Used tests are Pearson’sχ2 test, log-likelihood ratio test, and Wilcoxon rank-sum test. However, according to some research theχ2test is not suitable for text [12]. This is due to common words defeating the null hypothesis easily. The log-likelihood ratio is also problematic since the test is based on the assumption of independent samples whereas words in text are not independent [8].

Word frequencies in transcripts can be tested for significant differences between ’sale’ and ’no sale’ classes. The Wilcoxon rank-sum test can be applied to the total word frequencies in both classes. Individual words can also be tested on their frequency in multiple conversations.

3 METHODOLOGY

The approach presented in this paper is using automatic speech recognition (ASR) on recorded phone calls. This results in a tran-script of all spoken words in the conversation. Three speech models are used to transcribe and performance is compared. The models are:

– CMU Sphinx: A model is trained with the use of the open source toolkit CMU Sphinx.

– Google Speech API: A cloud solution with an existing model for Dutch.

– Kaldi-NL: An existing model for Kaldi is used on a server. Transcriptions are made with the use of speech recognition. Text-based sentiment analysis is performed on the transcript and results in a ’sale’ or ’no sale’ prediction. Classification of telephone calls that convert or not are used to find key differences between both classes.

Corpus differences between the ’sale’ and ’no sale’ classes are visualized and tested for word frequency differences. The visualiza-tions and measures are then shown in the ’results’ section.

The first subsection describes the creation of a model, after that the other two models are described. Then the approach to sentiment analysis is reported. After that the approach to visualizations is reported.

3.1 Creating a speech recognition model with

CMU Sphinx

Creating or adapting a model for a specific domain or acoustic envi-ronment might provide better performance than an existing model. Two of those open-source systems are Kaldi and CMU Sphinx [2, 17]. Kaldi is only available for Linux. CMU Sphinx is available for Mac and Windows. Therefore, CMU Sphinx is chosen for this research. A speech recognition model is trained on data. This data consists of speech audio with a transcript. The model of this research is

(4)

Dutch. For the best performance the acoustic environment should match for training and testing. Sampling rate should also be the same2.

3.1.1 Data. The dutch audio that is used for training of the model is the Spoken Wikipedia Corpora3. Wikipedia contains articles about individual topics, which makes it a diverse set of words. The Dutch Spoken Wikipedia Corpora contains 79 hours of word aligned spoken language. The time alignment is not done for every word, but every word in the speech audio needs a transcript. Misalignment is a problem that can occur when the speech contains utterances that are not in the transcript. The recognizer then hears audio and uses the wrong word for that utterance. This is solved by taking the time stamp of a word near the beginning of a sentence and at the end of a sentence and taking all words in between. This creates sentence length audio clips accompanied by the text of the sentence. 3.1.2 Preprocessing Audio. In CMU Sphinx audio clips must be the length of roughly one sentence. The audio clips are stored per article, so these are cut to the appropriate length. The reason for roughly one sentence length is that longer audio clips might get out of sync with the text. When the words are no longer aligned with the speech, the wrong words are recognized for that speech. This can also be a problem when some words are not in the text, but are spoken.

Audio clips cannot simply be cut by a set interval, because then words are cut in half. The audio clips must be cut on a time stamp of the words that are spoken in the audio clip and contain the words that are between time stamps. The audio is cut on the time stamp and the words that are spoken in that audio clip are saved. Audio is cut by using FFMPEG4.

Cmu Sphinx uses the audio format ’.WAV’. FFMPEG is used to convert audio formats. It is also necessary to have uniformity in the sample rate. Other sample rates can be used, but the training data must have the same sample rate as the recognized speech. The audio should also be in mono with a single channel.

3.1.3 Training the CMU Sphinx speech model. CMU Sphinx has a Dutch dictionary and language model available5. The acoustic model is trained on 13 hours of speech, which is usually way too little [20]. However, this model is still useful for adapting with the data discussed above. Adaptation of a model is suitable for increasing the accuracy of the model and adapting to an acoustic environment. This means no model has to be built from scratch. With the language model, dictionary and training data in place the model is trained.

The steps taken in training the model are as follows. An acoustic feature file is generated for every individual audio clip. The next step is collecting statistics from the adaptation data. After that adap-tation to the HMM is done. Two methods that are frequently used are Maximum A Posteriori (MAP) and Maximum Likelihood Linear Regression (MLLR). A combination of the two is most successful [18]. Adaptation is therefore done using a combination of the two methods.

2_{https://cmusphinx.github.io/wiki/tutorialadapt/} 3_{https://nats.gitlab.io/swc/}

4_{https://ffmpeg.org/}

5_{https://sourceforge.net/projects/cmusphinx/files/}

The trained model is able to transcribe phone call conversations it has not been trained on. The other two existing models can also be used to transcribe the same phone call conversations. These models can then be compared on errors and speed. How the other two models are used to transcribe is discussed below.

3.2 Speech Recognition on a server with

Kaldi-NL

A model for Dutch is created with the Kaldi speech toolkit at the University of Twente6Although the Kaldi toolkit is only available on Linux. To make it available on Windows and Mac a container is set up with Docker. Containerization solves the problem of system dependencies by providing the container with all its dependencies and abstracting it from an operating system. A port is opened on the Docker container to communicate with the host computer. A bind mount serves the purpose of accessing files between host and container. A decode script is called in the container environment with the audio and output directory as arguments.

Preparation of transcription involves segmentation. The recog-nition is done with the use of a neural network. Rescoring is done as a last step to improve the recognition rate. The transcripts are saved as text for every phone call.

3.3 Cloud speech recognition with Google

Speech API

Google Speech API7is a cloud solution for speech recognition. A connection is made to the API with a Python script. The audio is sent to the cloud and a transcription is retrieved. A Dutch model is selected for this. Google Speech API has a synchronous and asynchronous process. The synchronous process uses local files of maximum one minute. The asynchronous process can handle longer audio files, but must be stored in the Google cloud. Since phone calls are often longer than 1 minute the asynchronous process is used to transcribe.

Although a Dutch model is available, there is no phone call acoustic model for Dutch. The acoustic environment of a phone call is different than microphone for example. The effect of a mismatch in acoustic environment is worse performance on recognition.

3.4 Speech Recognition Comparison

The evaluation of speech recognition is done with the use of the textdistance8package in Python. The Jaccard index and WER are calculated for transcripts of three speech systems in comparison to a manually adjusted transcript. One hour of phone calls is manually transcribed for evaluation.

3.5 Sentiment Analysis

The purpose of sentiment analysis on phone calls is creating in-sights into positive and negative calls. A positive call is a sale and a negative call is no sale. This is a binary classification problem. Three algorithms are used to predict the correct class. The algorithms are capable of finding distinguishing terms between classes. These

6_{https://github.com/opensource-spraakherkenning-nl} 7_{https://cloud.google.com/speech-to-text/docs/} 8_{https://pypi.org/project/textdistance/}

(5)

terms are visualized and a comparison is made for the corpora. The data used for classification is discussed in the subsection data. 3.5.1 Data. The data for sentiment analysis are phone call conver-sations. The conversation is between an advisor and a customer. The subject of these phone calls is solar panels. The language spo-ken is Dutch. The conversations are about placement of solar panels on roofs and everything that has to do with it. Conversations do not follow guidelines and can go in many different directions.

Attributes for the phone calls are: sale or no sale, date, duration, customer name, advisor name, and direction of call. Classification is done on the sale or no sale attribute.

Two groups of phone calls are made. For sales intakes, which is the first call with a customer. And a random sample is taken of which some calls can be sales intakes or follow up calls about sales. For both groups a thousand calls are selected.

Two thousand calls are selected and automatically transcribed by the speech recognition system. Calls are roughly ten minutes long. This makes the total duration 166 hours for both groups. Table 1 shows the distribution between ’sale’ and ’no sale’ conversations.

Table 1: Phone call data

Call group Sale No Sale Total Duration First conversation 410 590 1000 166

Random sample 630 370 1000 166

3.6 Classification of phone calls

Classification is done with the use of machine learning. The ma-chine learning approach uses a label for each call. This is an indi-cator of customer interest and can be seen as positive or negative. Making a distinction between these groups is binary classification. Classification is done on words. Different words are used in positive and negative phone calls. The sale or no sale serves as a label for classification. Therefore it is supervised classification.

Suitable supervised classification algorithms for text are: Naive Bayes, Support Vector Machine (SVM) and Logistic Regression [4, 23]. These algorithms are tested on the same transcriptions of speech. The performance metrics are shown in the results section. Before words can be used by a classifier some pre-processing needs to be done. First special characters are removed and words are turned to lowercase. Also the words that are used almost in every document or in almost none are removed, since these are not important for classification. These parameters are tuned to the best classification performance.

A classifier algorithm is not able to use words, so words in a document are represented by a vector. The words are given a tf-idf score. Tf stands for term frequency. This is the amount of times the word is in the document. Idf stands for inverse document frequency which is a score that gets lower when a word is in more documents. The vector can consist of each word as a term or use multiple terms called n-grams. When for example the word “good” has “not” in front of it it means the opposite of good. A bi-gram can take these negations into account whereas a single term can not. Uni-grams and bi-grams are tested and performance is shown in section 4.2.

Parameters are tuned for best classification performance. The ROC curve gives insight into the true positive rate and the false positive rate. The true positive rate is the measure of correctly identified as positive. The false positive rate is the measure of correctly identified negatives. With the use of the ROC curve the best tuned model is selected. The ROC curve for SVM is shown in figure 1.

Figure 1: ROC curve of SVM on sales conversations

Accuracy can be a deceiving measure in skewed data. This is not the case here since training and test data are balanced for both groups. Therefore, an increase in precision and recall is usually an increase in accuracy.

3.7 Sentiment Visualization

Classification coefficients indicate words that are more often in one group than the other. The bigger factor weights contribute more to the classification of positive or negative. These terms have the biggest impact on the classification. Visualizations clarify these factors with the purpose of finding terms that are distinctive for a group.

The scatter plot of Figure 2 shows the word frequency in ’sale’ and ’no sale’ groups. This scatter plot is created with Python Pyplot. The scatter plot shows frequency of words but does not show lin-guistic variation between groups clearly. Since the amount of words between corpora is not equal the word occurrence per 25k words is used. The words shown in the scatter plot are chosen because of their coefficients.

Visualizing linguistic variation between groups of text is done with the Scattertext tool9. Instead of frequency the Scattertext uses a ratio to visualize the relative occurence. The Scattertext is based on a scatter plot which displays a number of words in the corpus. The coordinates of the word indicate the frequency of the word and the ratio of the occurence in both categories [10]. The Scattertext is shown in Figure 3.

9_{https://github.com/JasonKessler/scattertext}

(6)

The odds ratio is a measure of association between two groups. The log-odds ratio is calculated by dividing the relative word oc-currence in ’sale’ by the relative word ococ-currence in ’no sale’ and taking the log.

With the use of Scattertext terms can also be queried and the transcripts that contain that term are found. When for example a word is predominantly positive or negative the phone calls can be found that contain this word.

3.8 Corpus Comparison

To determine a difference in word frequency of both conversation classes a statistical test is used. According to literature the Wilcoxon rank-sum test is suitable for comparing word frequencies between two corpora [8, 12].

The Wilcoxon rank-sum test serves to test significance of word frequency in ’sale’ and ’no sale’ conversations. The test is performed on words with their total frequency in the ’sale’ or ’no sale’ group. The test is also performed on individual words and their frequency in a thousand individual conversations.

The reason that the test is not done on every word for every conversation is because of the introduction of the multiple compar-isons problem. This problem can be avoided here by not testing all words individually.

Classification delivers a list of 1500 words. For SVM the list contains 750 positive and 750 negative coefficients. The positive and negative words are both tested. The word occurrence for ’sale’ and ’no sale’ is counted and tested. Positive and negative coefficients are split in the test to measure a difference between the two.

4 RESULTS

The results section follows the structure: speech recognition evalua-tion, classification performance, sentiment visualization and corpus comparison. Measurements are given of speech recognition and classification performance. Visualizations are shown in multiple fig-ures. Significance of word differences are given in tables. Findings will be stated and problems encountered will be discussed.

4.1 Speech Recognition Evaluation

The speech recognition models compared in this paper are: Google Speech API for Dutch, the open source Kaldi-NL speech model, and a model trained in CMU Sphinx. These models work differently and have different attributes. First the performance measures are given. After that other differences are stated.

Different speech recognition models make different errors. The systems are compared on word error rate (WER) and Jaccard Index as mentioned in the related literature. Table 2 gives the performance measures on phone call conversations.

Table 2: ASR performance measures on Dutch phone call conversations

ASR system WER Jaccard index Kaldi-NL 37.6% 55.4% Google Speech API 74.2% 22.8% CMU Sphinx 79.3% 15.9%

As can be seen in table 2, Kaldi-NL has the lowest error rate and captures most of the original conversation. Besides these metrics there are other differences between systems. Some differences are: speed of computation, access and ease of use, additions to the model, and format of output. The models are each discussed in the sections below.

4.1.1 CMU Sphinx. A problem that was encountered during train-ing is the train data mostly consisttrain-ing of a few speakers. The Spoken Wikipedia Corpus has 145 speakers, however a few top speakers contributed to a big part of the corpus. When training for just a few people the model fits to those speakers. Many speaker dictation requires many people. Therefore the performance of this system is not optimal.

There is no segmentation to solve speakers talking at the same time. Speed of dictation is also lacking compared to other systems. For recognition CMU Sphinx uses ’.WAV’ files. These are uncom-pressed and can be multiple times larger than ’.mp3’ files that the other systems are able to use for recognition. There are other fac-tors that contribute to speed. This is however difficult to compare, since Google Speech API is only available in the cloud and this is not comparable to local recognition.

4.1.2 Kaldi-NL. One limitation of Kaldi is the availability on Linux only. This problem was solved by using a Docker container to run Kaldi in. A connection can be made to the container on any device that has network capabilities.

One way performance of Kaldi increases is segmentation. Individ-ual speakers are partitioned into segments. After that recognition is started. The problem of speakers speaking at the same time is solved this way.

Although it is difficult to compare speed of a local system, a server, and a cloud because of different hardware. An advantage of using Kaldi in Docker is the scalability. A Docker environment can scale to multiple machines in a cluster. Speed of dictation is easily adjusted by the scale of the cluster.

4.1.3 Google Speech API. Google Speech API is a cloud solution. The advantage is ease of implementation. The drawback of an API connection is the limitations in adjustments. The Dutch model of Google does not have diarization and there is also no phone call acoustic model available. These features are available for the English model.

The output length of dictation is significantly shorter in length for Google than for the other speech systems. The amount of words in the evaluation is just 26% of Kaldi-NL. However, the words that are in the output capture 22.8% of the spoken text as can be seen in Table 1. This suggests that Google outputs words only that it is confident about above a certain threshold.

4.2 Classification

Classification is done twice: on sales intakes and on a random sample of sales calls. The three classification algorithms used are: Naive Bayes, Logistic Regression, and SVM. For each algorithm uni-grams and bi-grams are calculated. The accuracy, precision, and recall are shown on a random sample of sales calls in Table 3. On the test data SVM has the highest accuracy of 79%. Using bi-grams lowers the classification scores for all algorithms.

(7)

Table 3: Classification on sales transcripts

Algorithm Accuracy Precision Recall

SVM 79% 0.79 0.79

Naive Bayes 77% 0.76 0.77

Logistic Regression 75% 0.76 0.77 Naive Bayes bi-gram 75% 0.76 0.75

SVM bi-gram 72.5% 0.73 0.72

Logistic Regression bi-gram 69% 0.73 0.73

Classification is also done on the first call with a customer. These classification measures are shown in Table 4. Here the Naive Bayes algorithm performs best. Therefore one classification algorithm does not outperform another in all cases.

Table 4: Classification on sales intakes

Algorithm Accuracy Precision Recall

Naive Bayes 67% 0.68 0.67

Naive Bayes bi-gram 65% 0.67 0.65

SVM 63% 0.62 0.63

Logistic Regression 62% 0.62 0.62 Logistic Regression bi-gram 62% 0.61 0.63

SVM bi-gram 55% 0.54 0.55

The performance measures indicate that precision and recall are often similar. This indicates that the classification algorithms do not favor one over the other. This means that the algorithm does balance retrieving all relevant items and retrieving only these when positive.

4.3 Sentiment Visualization

The frequency of words in a ’sale’ or ’no sale’ conversation is shown in the scatter plot of figure 2. The words shown are the top ten words in ’sale’ and the top ten words in ’no sale’.

Figure 2: Frequency/frequency scatterplot of words

The scatter plot shows some words that are more often in one group than in the other. However, the difference is not easily quan-tified this way. The Scattertext adds to this.

The Scattertext plots the log-odds ratio and the log frequency for words. The 1500 words of classification are shown. Words that are in less than 5 documents are removed. Also words that are in more than 70% of the documents are removed. The top positive and top negative words are shown, since not all words fit in one figure. The Scattertext is shown in figure 3.

4.4 Corpus Comparison

To test statistical significance in word frequency difference the Wilcoxon rank-sum test is used. The Wilcoxon rank-sum test is applied to total word frequency for positive and negative words in the ’sale’ and ’no sale’ class. The test is also done on some individual words and their word frequency in a thousand transcripts.

The result of the Wilcoxon rank-sum test for positive and nega-tive words is shown in Table 5. The results are highly significant for both groups according to the p-value.

Table 5: Wilcoxon rank-sum test on total word frequency Group frequency ratio p-value

Positive 1,59764 2.05208e-06 Negative 0,62592 1.40809e-10

Testing for both positive and negative word frequency results in significant differences. However, on an individual word level results may vary. Visualizations resulted in words that are more frequently in one group. Some of these words are shown in Table 6 with a p-value for the Wilcoxon rank-sum test.

The words used in Table 6 are all significant. Not all words are tested for significance, since testing all words individually intro-duces the multiple comparisons problem.

As can be seen in Table 6 a higher sale/no-sale ratio is not nec-essarily a lower p-value. This is likely due to the distribution of words in the corpus, where the appearance of words in a corpus is in bursts.

Table 6: Wilcoxon rank-sum test on word frequency Dutch word English sale/no-sale ratio p-value

Akkoord Agree 3.64 1.16603e-45

Prima Fine 1.29 4.80939e-11

Goed Good 1.1 2.50576e-11

zometeen Later 1.32 2.63690e-05

straks Soon 1.06 1.99887e-05

Jij You 1.13 7.77938e-06

Misschien Maybe 0.8 0.00142

Informatie Information 0.6 0.00084

Rendement Profit 0.52 0.00545

Interesse Interest 0.32 3.19611e-11

(8)

Figure 3: Scattertext of positive and negative words

5 DISCUSSION

The research presented in this paper finds significant difference in word frequency between ’sale’ and ’no sale’ phone call conversa-tions. There is a significant difference on individual words and on a group of positive and negative words.

Classification can successfully be performed because of a differ-ence in word frequency. The classification accuracy of SVM on a random sample of sale conversations is 79%. Classification accuracy for the first conversation is 67% with the Naive Bayes algorithm. This indicates that a more diverse conversation content brings about a higher classification accuracy. Also, there does not seem to be a best classification algorithm in every case.

The approach of this paper is generic enough to be applied to other domains and other languages. When introducing another domain the classification algorithm should be trained for that do-main, since words might have different meaning in different context. When using this approach for another language a different speech model is necessary for that language. Also the classification algo-rithm needs to be trained for that language.

Transcriptions used in this study are created with the Kaldi-NL model. These transcriptions capture more than half of the spoken words with a word error rate of 37.6%. This is highly more accurate than the models of Google and CMU Sphinx. The speech recognition systems have distinctive limitations and different powers. It might be possible for the other systems to perform better when placed in a different context.

Insights can be gathered on distinctive words. Some findings are given below:

– Words that indicate confirmation are significantly more often in sale conversations.

– Uncertain words are significantly more often in no sale con-versations.

– The words interested, information and options are signifi-cantly more often in no sale conversations.

– Words that indicate advance are significantly more often in sale conversations.

– When on a first name basis the conversation is significantly more often a sale.

It is beyond the scope of this research to conclude why a word appears more often in one group than the other. Other limitations of this study are: no speaker diarization and no acoustic features for sentiment prediction. These additions might be able to improve classification performance.

6 CONCLUSION

This research demonstrates that significant differences in word frequency can be found between ’sale’ and ’no sale’ phone call conversations about solar panels in Dutch. Classification can be successfully performed due to the difference in word frequency. In general, these results suggest that classification can be performed when a significant difference in word frequency is apparent.

The approach of this study is using speech recognition to create a transcript. Even when this transcript contains errors, classification

(9)

can still be performed. Classification is done with machine learning. There is no algorithm that performs best in all cases.

Words that appear more often in one class than the other are found with visualizations. With the current approach significant distinctive words can be found.

7 FUTURE WORK

Transcripts used in this study contain all spoken words in a conver-sation. Who said what is not taken into account. This work proved it is possible to classify accurately without speaker diarization. How-ever, it might be useful to know who said what in a conversation for better classification performance.

In this system speech recognition is done after the conversa-tion. It is also possible to do online recogniconversa-tion. In that case the speech recognition is done during the conversation. Words can be highlighted that contain sentiment during the conversation and a classification score can be calculated.

Terms that appear more often in one class can be found with the use of classification. The effect that these words have on a conversation is not known. A next step would be finding out why these appear more often in either group and what the effect is of using these words in a conversation.

REFERENCES

[1] Päivi Kristiina Jokinen Birgitta Ojamaa and Kadri Muischnek. 2015. Sentiment analysis on conversational texts. NODALIDA (2015).

[2] Rico Petrick Patrick Proba Ahmed Malatawy Christian Gaida, Patrick Lange and David Suendermann-Oeft. 2014. Comparing Open-Source Speech Recognition Toolkits. DHBW (2014).

[3] Souraya Ezzat, Neamat El Gayar, and Moustafa M. Ghanem. 2010. Investigating Analysis of Speech Content through Text Classification. International Conference of Soft Computing and Pattern Recognition (Dec. 2010), 105–110. https://doi.org/ 10.1109/SOCPAR.2010.5686000

[4] Apeksha Bhansali Saif Ali Khan Harsha Mahajan G V Garje, Apoorva Inamdar. 2016. SENTIMENT ANALYSIS: CLASSIFICATION AND SEARCHING TECH-NIQUES. IRJET 3, 4, Article 651 (April 2016), 2796-2798 pages.

[5] Mark Gales and Steve Young. 2007. The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends in Signal Processing 1, 3, Article 3 (Jan. 2007), 195–304 pages. https://doi.org/10.1561/2000000004

[6] Mounia Baeza-Yates Ricardo Graells-Garrido, Eduardo Lalmas. 2016. Sentiment Visualisation Widgets for Exploratory Search. Social Personalization Workshop (Jan. 2016).

[7] David Griol, José Manuel Molina, and Zoraida Callejas. 2019. Combining speech-based and linguistic classifiers to recognize emotion in user spoken utterances. NEUROCOM 326-327 (Jan. 2019), 132–140. https://doi.org/10.1016/j.neucom.2017. 01.120

[8] Tanja Säily Panagiotis Papapetrou-Kai Puolamäki Jefrey Lijffijt, Terttu Nevalainen and Heikki Mannila. 2014. Significance testing of word frequencies in corpora. Literary and Linguistic Computing 31, Article 2 (Dec. 2014), 374–397 pages. https: //doi.org/doi.org/10.1093/llc/fqu064

[9] Yonghong Yan Chaomin Wang-Zhijie Ren Pengyu Cong Huixin Wang Jia Sun, Weiqun Xu and Junlan Feng. 2016. Information Fusion in Automatic User Sat-isfaction Analysis in Call Center. IHMSC (Aug. 2016). https://doi.org/10.1109/ IHMSC.2016.49

[10] Jason Kessler. 2017. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations (April 2017).

[11] Eva Kiktova and Jozef Juhar. 2015. Comparison of Diarization Tools for Building Speaker Database. INFORMATION AND COMMUNICATION TECHNOLOGIES AND SERVICES 13, 4 (Nov. 2015), 6. https://doi.org/10.15598/aeee.v13i4.1468 [12] Adam Kilgarriff. 2011. Comparing Corpora. International Journal of Corpus

Linguistics 6, Article 1 (Nov. 2011), 97–113 pages. https://doi.org/doi.org/10.1075/ ijcl.6.1.05kil

[13] Carita Paradis Kostiantyn Kucher and Andreas Kerren. 2017. The State of the Art in Sentiment Visualization. COMPUTER GRAPHICS forum 37, 1, Article 1 (June 2017), 71–96 pages. https://doi.org/10.1111/cgf.13217

[14] Veton Këpuska and Gamal Bohouta. 2017. Comparing Speech Recognition Sys-tems (Microsoft API, Google API And CMU Sphinx). IJERA 7, 3, Article 2 (March 2017), 5 pages. https://doi.org/10.9790/9622-0703022024

[15] Miikka Kuutila Mika V. Mäntylä, Daniel Graziotin. 2018. The Evolution of Sentiment Analysis - A Review of Research Topics, Venues, and Top Cited Papers. Computer Science Review 27 (Feb. 2018), 16–32. https://doi.org/10.1016/j.cosrev. 2017.10.002

[16] Don O’Sullivan and John McCallig. 2012. Customer satisfaction, earnings and firm value. European Journal of Marketing 46, 6, Article 2 (March 2012), 20 pages. https://doi.org/10.1108/03090561211214627

[17] Evandro Gouvêa Bhiksha Raj-Rita Singh William Walker Manfred Warmuth Peter Wolf Paul Lamere, Philip Kwok. 2003. THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM. ICASSP (Jan. 2003).

[18] T Ramya, S Lilly Christina, P Vijayalakshmi, and T Nagarajan. 2014. Analysis on MAP and MLLR based speaker adaptation techniques in speech recognition. IEEE (March 2014). https://doi.org/10.1109/iccpct.2014.7054938

[19] Maghilnan S and Rajesh Muthu. 2018. Sentiment Analysis on Speaker Specific Speech Data. I2C2 (Feb. 2018). https://doi.org/10.1109/I2C2.2017.8321795 [20] Karen Livescu Adam Lopez-Sharon Goldwater Sameer Bansal, Herman Kamper.

2018. Low-Resource Speech-to-Text Translation. Interspeech (Sept. 2018). https: //doi.org/10.21437/Interspeech.2018-1326

[21] Asma Alshahrani Amany Mubarak Sara Albugami Nada Almutiri Aisha Al-bugami Shaha Al-Otaibi, Allulo Alnassar. 2018. Customer Satisfaction Measure-ment using SentiMeasure-ment Analysis. IJACSA 9, 2, Article 2 (Jan. 2018), 106–117 pages. https://doi.org/10.14569/IJACSA.2018.090216

[22] Harsh Thakkar and Dhiren Patel. 2015. Approaches for Sentiment Analysis on Twitter: A State-of-Art study. (Dec. 2015).

[23] Alexandre Trilla and Francesc Alias. 2013. Sentence-Based Sentiment Analysis for Expressive Text-to-Speech. IEEE 21, 2, Article 2 (Feb. 2013), 223-233 pages. https://doi.org/10.1109/TASL.2012.2217129