Visualizing conversations using knowledge graphs

(1)

Visualizing conversations

using knowledge graphs

(2)

Layout: typeset by the author using LA_TEX.

(3)

Visualizing conversations using

knowledge graphs

Sabijn Perdijk 11864265 Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. B. Bredeweg Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th_{, 2020}

(4)

Abstract

Visual representation is one of the most effective methods of summarizing text. This method is not only restricted to summarizing textual documents, but it can also be used for more general settings such as conversations and discussions. Vi-sualization of discussions stimulates collaborative processes and improves critical thinking. Therefore, in this research, an application has been designed to visually represent conversations. To design this application, the best method to extract and structure keywords from a continuous flow of text, is explored.

(5)

Acknowledgments

I would like to thank my supervisor Bert Bredeweg, for providing guidance, ideas and useful critiques throughout this project.

Furthermore, I would like to thank Angelo for reading and re-reading my thesis, for the useful feedback, and for the funny comments.

(7)

Chapter 1 Introduction

As the world wide web develops, structuring, summarizing, and evaluating large amounts of information has become increasingly important [21]. One of the most effective methods of summarizing text is visual representation [13]. Not only can visual representation be used to summarize textual documents, but also to summa-rize conversations and discussions. AI could be helpful by automatically generating visual representations of conversations. Most of the research done on visualizing conversations has focused on conversations conducted in the English language [5, 4, 20, 3]. However, it could also be advantageous to develop this kind of visu-alization for conversations in other languages. This research focuses therefore on visualizing conversations conducted in Dutch.

The study aims to answer the following question: how can conversations be visualised using graphs? To answer this question, the following sub-questions are formulated. First, how can relevant concepts be extracted from a continuous flow of text? Second, how can the filtered keywords be structured such that the attendees are able to recall the conversation to a greater degree? Third, to what extent can the knowledge graph be interactive?

To answer these research questions, a theoretical background is discussed in chapterTheoretical background. After that, the implementation and the design are discussed in chapterImplementation & designand evaluated in chapterEvaluation

and Discussion. Lastly, a conclusion is drawn in chapter Conclusion.

(8)

Chapter 2 Theoretical background

To create an overview of automatically summarizing and visualizing, a theoretical framework is discussed. First visual representation as summarizing method is discussed. After that, the scientific definition of important concepts in a text, is discussed. Finally, the applied method is discussed.

2.1 Visual representation

Van Garderen [23] describes in his research several advantages of the use of visual representation. In the first place, the use of visual representation can help students to manage the memory demands, in this case remembering a conversation. Further using visual representation can motivate students as creating and using visual representation is experienced to be more fun than using a textual representation.

2.1.1 Different methods of visual representation

Burlutskiy et al. [3] explored eight different ways to visualize a conversation. They concluded, based on user evaluation, that the Matrix Chart, the Network Diagram and the Block Histogram outperformed the alternative visualisations. The Matrix Chart and a peculiar form of a Network Diagram are discussed. The Block Histogram, a chart used to present frequencies in quantitative data, is not suitable for this research as a Block Histogram does not capture the context in which keywords are used. Therefore, the Block Histogram is neglected in this discussion.

A matrix chart is a chart containing multiple rows and columns. The created cells contain a graphical representation of the data. This graphical representation can have varying colours, sizes and shapes [14]. The goal of a matrix chart is to

(9)

find correlations in the data. In this research, these correlations consists of the most relevant topics in the conversation.

Next, a network diagram or a graph is a data structure showing relations be-tween entities. The relations are represented by the edges and the entities are represented by the nodes [9]. Similarly with the graphical representation used in the matrix chart, the nodes may vary in colour, size, and shape. In this research, a peculiar form of a network diagram is used to visualize conversations: a knowledge graph (KG). In the scientific community, there is no consensus about the precise definition of a knowledge graph. In their research Ehrlinger and Wöß [6] exam-ined multiple definitions and based on the examexam-ined definitions they proposed a new definition of a knowledge graph: “a knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge” [6]. This definition of a knowledge graph is visualized in Figure 2.1.

Figure 2.1: The architecture of a knowledge graph proposed in [6]

Ehrlinger and Wöß [6] stated that, going by this definition, a knowledge graph is in the first place an ontology. An ontology is a conceptualization characterized by the inclusion of semantic information. A knowledge graph is distinguished from an ontology based on two features: a knowledge graph is more extensive than an ontology and it contains additional features.

In addition to using a matrix chart or network diagram to visually represent conversations, it is possible to use colour coding. Colour coding adds an extra dimension and can be used to find a target in a chart in a smaller amount of time [10].

2.2 Relevant concepts

To create a visual representation of a conversation, it is necessary to extract rele-vant concepts in the form of keywords from the conversation. Saggion et al. [19] define keywords as significant words in a text, and those keywords can be filtered using statistics, linguistic features or machine learning. The most relevant statistic and linguistic approaches, described by Saggion et al., are pointed out below. Due to the time-consuming nature of machine learning approaches, these approaches

(10)

are neglected in this discussion. A widely used statistical approach is Term Fre-quency (TF). Using this approach, the filtered keywords consist of the words with the highest term frequency after removing stop words. A disadvantage of this method is that synonyms or coreferential expressions are not taken into account.

The most prevalent linguistic approaches are filtering based on syntactic, se-mantic, and lexical features. The word syntax refers to the way words are arranged [11]. So, syntax divides the sentence in a predicate, or the root, and his arguments, the mandatory substituents in the sentence. When syntax is used to filter out key-words, words can, for example, be filtered because their tag is subject or object. Beside syntactic approaches, there are semantic approaches. Semantic analysis creates a relation between a word and the knowledge of the world [11]. An ap-plication in filtering out keywords is the addition of paradigmatic relations, words that can be substituted for each other. Finally, there are lexical semantics. Lex-ical semantics is regarding the meaning of words. An application in filtering out keywords is the addition of synonyms or hypernyms in the algorithm.

2.3 Method

In this subsection, the method is discussed in summary to give a concrete picture of the process. In chapter Implementation & design the method will be discussed in more detail. In this research, a Python application has been developed which can be used to visualize conversations using knowledge graphs. This application utilizes the following external libraries: Google Speech API, SpaCy, Open Dutch Wordnet and Gensim. Spacy, Open Dutch Wordnet and Gensim are briefly discussed below. SpaCy is a widely used open-source library for natural language processing (NLP). SpaCy can be used in combination with Python to build information ex-traction or natural language understanding systems or to pre-process text for deep learning [22]. In this research spaCy is used to pre-process text before structur-ing the knowledge graph. Most researches use Natural Language Toolkit (NLTK) for this purpose because NLTK is specially designed for teaching and research [22]. Therefore NLTK has more methods with different performances for equiva-lent functionality. However, in this research spaCy is chosen because it processes faster, it has more features, and it performs better than the default algorithms of NLTK for processing Dutch, which is the target language for this research.

Open Dutch Wordnet is an opensource Dutch lexical-semantic database and is based on the Cornetto (Combinatoric and Relational Network for Language Tech-nology) database. Open Dutch Wordnet contains more than a hundred thousand synsets, a set of words which are interchangeable in a context [17]. Finally, Gensim (Generate Similar) is a Python library for topic modelling, document indexing and similarity retrieval with large corpora [18].

(11)

To achieve the goal of this research, the process can be subdivided into four subprocesses. A visualisation of the process is shown in Figure2.2.

Figure 2.2: Flow chart of the method used in this research

First, the conversation is transcribed using the Google Speech API. Subse-quently, the relevant concepts are filtered from the obtained transcript using spaCy. After that, the relevant concepts are structured using two different charts. In the first chart, the relevant concepts are structured over time. An example of this chart is shown in Figure 2.3.

(12)

In the second chart, the concepts are structured based on topic. This means that the words are structured either based on word similarity or based on entity similarity. In the ideal knowledge graph, syntactic information is added. The chart which is aimed for is shown in Figure 2.4.

Figure 2.4: Relevant concepts structured based on topic

After visualizing the relevant concepts in both ways, there is the option to change the chart manually. This interactivity is important in the first place because in this way the user can correct small errors and optimize the knowledge graph such that the user can remember the conversation in a greater degree. Likewise, the program should be able to learn from this manual correction such that the program does not make the same mistake twice.

(13)

Chapter 3 Implementation & design

In this section, the implementation and the design of the application is discussed. The implementation and the design is discussed in the same order as described in section Method. First, the implementation of the speech recognition is dis-cussed. Subsequently, the implementation of the time-based graph is disdis-cussed. After that, the implementation of the topic-based graph is discussed. Lastly, the implementation of the user interaction is discussed.

3.1 Speech recognition

The first subprocess is transcribing the conversation. To perform speech recog-nition in this research, the cloud Speech-To-Text of Google is used. The Google Speech API is able to transcribe audio from recordings or from a microphone. In this research, the conversation is transcribed from a recording because the free version of the API is only capable to transcribe the first utterance when using the option to transcribe from a microphone. The recordings are made using the libraries PyAudio and Wave [16, 7]. Despite the fact that the Google Speech API is free and easy to use, there are two disadvantages. At the moment of writing, the Google Speech API is capable of speaker diarization and automatic punctuation in English. Speaker diarization, recognizing multiple speakers in an audio file, could be advantageous in this research because then it becomes possible to add speaker diarization to the final chart. Automatic punctuation is important because part-of-speech tagging works better when punctuation is provided. Unfortunately the Google Speech API is not capable of speaker diarization and automatic punctua-tion in Dutch. So, these features were not used in this research.

When the transcript from the speech recognition is obtained, relevant concepts are filtered out. The method to filter the relevant concepts depends on the final chart. So first the filtering and the structuring of the time-based chart are

(14)

cussed. After that, the filtering and the structuring of the topic-based knowledge graph are discussed.

3.2 Time-based chart

To filter out the relevant concepts for the time-based chart, TF is used with certain preprocessing steps. The first preprocessing step is to remove stop words. Stop words are the most used words in a language and most of these words have no meaning without context. Therefore they are removed before applying statistical methods such as TF such that only meaningful words are selected.

The second preprocessing step is to obtain the lemmas of the words in the text. Lemmatisation is the task of grouping together word forms that belong to the same inflectional morphological paradigm and assigning to each paradigm its corresponding canonical form called lemma [8]. Using the lemma of a word, inflected forms of a word are treated as one word instead of multiple words.

The third preprocessing step is to obtain the nouns and the verbs. These are obtained using part-of-speech tagging (POS-tagging). POS-tagging algorithms parse a sentence and use statistical methods to determine the most likely tag or label of the words in the sentence. Only the nouns and the verbs are filtered because those are most likely to be relevant concepts.

The fourth preprocessing step is to obtain the synonyms of a word. In this way, synonyms are treated as one word instead of multiple words. Performing this step overcomes a disadvantage of the use of TF. As mentioned in chapter

Relevant concepts term frequency does not take synonyms into account. To take synonyms into account, the synsets of Open Dutch Wordnet are used. After this preprocessing step the most used nouns and verbs are selected and structured in the chart.

As mentioned in the section Methods the time-based chart shows the most relevant concepts over time. When a concept is highly used, it is displayed in a bigger font size and it preserves its colour such that the concept gets more attention. The time-based chart is able to visualize one or multiple speakers. When there are multiple speakers, colour-coding is used to distinguish the speakers.

3.3 Topic-based knowledge graph

The filtering of the topic based knowledge graph depends on the final way of struc-turing. In this research, two different methods of structuring based on topic are explored: structuring based on paradigmatic relations and based on named enti-ties. The two options are discussed below. First, the knowledge graph based on

(15)

paradigmatic relations is discussed. Paradigmatic relations are words which can occur in the same context but do not have the same meaning (unlike synonyms) [15]. To filter the relevant concepts for this knowledge graph, preprocessing is similar to the filtering for the time-based chart. However, there are two additional preprocessing steps. After the filtering of stop words, the lemmatisation and the POS-tagging, the paradigmatic relations of a word are obtained. These paradig-matic relations contribute to calculating the correct word similarity. The word similarity is calculated using Gensim. When the similarity score of two words is significant, the words are placed in the same branch of the knowledge graph. For the second knowledge graph, named entities are obtained beside the filtering of stop words, the lemmatisation and the POS-tagging. Named entities are real-word objects such as persons, companies or countries. The knowledge graph is structured at the hand of those entities.

(16)

Chapter 4 Results

In this section different possible charts are presented. The charts are based on the following five podcasts: Klassieke mysteries, De Willem Podcast, Van 2 kanten, a podcast of Pakhuis de Zwijger and De geschiedenis van het Romeinse Rijk. Those podcasts will be briefly summarized to understand the visual to a greater degree. After that, the presented charts are briefly explained. The charts are further discussed in chapter Evaluation.

This episode of Klassieke mysteries is regarding the life and death of the com-poser Mozart. De Willem Podcast is regarding the life and trial of the Dutch criminal Willem Holleeder. This episode of Van 2 kanten is regarding two par-ents with a handicapped child. The podcast of Pakhuis de Zwijger is regarding a conversation with Jim Jansen about the current state of the science. Finally the podcast De geschiedenis van het Romeinse Rijk is regarding the history of the Roman empire. These podcasts have been chosen to test the application with different topics of conversation.

In Figure 4.1, the time-based chart build on the podcast De Willem podcast is shown. In Figure 4.2, the same chart as in 4.1 is shown except for the used filtering. In this chart, only the nouns are filtered out. In Figure 4.3, the time-based chart build on the podcast Klassieke mysteries is shown. In this chart, the relevant concepts are extracted with the use of algorithms for dependency parsing. In Figure4.4, the time-based chart build on the podcast Van 2 kanten is shown. In this chart, the same method to extract relevant concepts is used as in Figure 4.1. In Figure 4.5, the time-based chart build on the podcast Klassieke mysteries is shown. In this chart, only the most prevalent relevant concepts are shown instead of the most prevalent and the less prevalent. In Figure 4.6, the knowledge graph based on the podcast of Pakhuis de Zwijger is shown. This chart is structured using paradigmatic relations. In Figure 4.7, the knowledge graph based on De geschiedenis van het Romeinse Rijk is shown. This graph is structured using

(17)

named entities.

(18)

Figure 4.2: Time-based chart build on The Willem podcast with only the nouns filtered out

Figure 4.3: Time-based chart build on Klassieke mysteries with the use of algo-rithms for dependency parsing

(19)

Figure 4.4: Time-based chart based on the podcast Van 2 kanten

Figure 4.5: Time-based chart build on De Willem podcast with only the most frequently used concepts filtered out

(20)

Figure 4.6: Knowledge graph structured at the hand of paradigmatic relations and clarified with highlighted named entities

(21)

Chapter 5 Evaluation

In this section, the presented results are evaluated. They are discussed in the same order as presented in chapter Results. First of all, the evaluation of the time-based chart is discussed. The time-based chart is evaluated by running the application multiple times with different conversations. Subsequently, the evalua-tion of the topic-based knowledge graph is discussed. The topic-based knowledge graph performed ab initio not adequately, so these charts are not evaluated in the same way as the time-based chart. Instead of the evaluation mentioned above, the topic-based knowledge graph is evaluated on sight and based on these observations a proof-of-concept is proposed.

5.1 Time-based chart

In the Figures 4.4 to 4.5, the results are shown of the evaluation of the process using the time-based chart. The results are evaluated set against the sub-questions. So first, the results are evaluated in light of filtering relevant concepts from a continuous flow of text and after that in light of structuring these relevant concepts. As mentioned above, the process is evaluated by testing the application with different conversations. The following five podcasts have been chosen: Haagse zaken, Klassieke mysteries, In het Rijks, De Willem Podcast, and Van 2 kanten. In Figure 4.1, the most important words are clearly shown in a larger font size and green. There is, however, a significant amount of stop words present in both images. For example, gaan (‘to go’), zeggen (‘to say’) and doen (‘to do’) are three of the largest words in the picture. These words do not contribute to the understanding of the conversation and should have been removed in the process. These words, however, were not removed because these words are not included in the stop word list of the external library spaCy. This problem could be solved by using a different method of extracting relevant concepts. In Figure 4.2 the

(22)

result is shown of extracting solely the nouns instead of the nouns and the verbs. After applying this method, the chart contains a significantly lower number of stop words. However, the chart lacks numerous meaningful verbs like aanpakken (‘to deal with’) and afpersen (‘to blackmail’), so this method of extracting relevant concepts, is nevertheless undesirable.

To create a chart without stop words and to preserve meaningful verbs, a dif-ferent method of extracting relevant concepts, is explored. In Figure4.3, the chart build on the podcast Klassieke mysteries is shown which is obtained with the use of algorithms for dependency parsing. In these charts, solely the predicate, or the root, and the arguments of the predicate are extracted, as these parts of the sentence contain nearly all meaningful information. However, as shown in Figure

4.3, the chart of Klassieke mysteries is nearly blank. A number of charts, build on different podcasts, stayed completely blank when algorithms for dependency parsing were used. These blank charts can be explained by the performance of the software for speech recognition in combination with the algorithms for dependency parsing. The accuracy of the speech recognition software of Google for the Dutch language is approximately 45%, according to the research of Koster [12]. Hence, the majority of the transcribed sentences is not considered to be correct Dutch. However, the algorithms for the dependency parsing, trained on the Lassy Univer-sal Dependencies corpus [22], are solely able to parse correct Dutch sentences. So, the combination between the poor performance of the software for speech recog-nition and the algorithms for dependency parsing, leads to nearly blank charts. In conclusion, this method appears to be promising, but due to the accuracy of the software for speech recognition, it does not function properly. In conclusion, the filtering of the verbs and the nouns represent the conversation in the best way possible even though the charts contain a significant number of stop words.

After discussing the method of filtering out relevant concepts, the structuring is discussed. In the time-based chart, the relevant concepts are structured over time. Therefore, the relation between the relevant concepts remains intact, because the words stay in their context. A disadvantage of the chosen structuring is the fact that relevant concepts are shown at the last time of usage. Therefore, it is huddled on the right side of the chart. This phenomenon is illustrated in Figure 4.4. A solution for this problem could be to remove all grey words to preserve only the prevalent concepts. The chart, build on this solution, is visible in Figure 4.5. The chart becomes more moderate, but loses the relation between the relevant concepts. In conclusion, the time-based chart does represent the conversation correctly, but the right side of the chart is huddled so that the chart becomes slightly complicated.

(23)

5.2 Knowledge graph based on paradigmatic

rela-tions

In the Figures 4.6 and 4.7, the final topic-based knowledge graphs are shown. Similarly, with the evaluation of the time-based chart, the results are first evaluated in light of filtering out relevant concepts from a continuous flow of text and after that, in light of structuring these relevant concepts.

As mentioned in the beginning of this section, the knowledge graph based on paradigmatic relations performed ab initio not adequately, so the knowledge graph is evaluated on sight. The method of filtering is the same as the method used for the time-based chart. One disadvantage of this filtering, in case of the time-based chart, was that the charts contained a significant number of stop words. However, this disadvantage is less bothersome in this knowledge graph than in the time-based chart, because the stop words are grouped through the structuring time-based on paradigmatic relations. In this knowledge graph, it is possible to ignore the stop words because they are grouped in the same branch. For example, in Figure

4.6 stop words like kennen (‘to know’), snappen (‘to get’) and doen (‘to do’) are grouped in the same branch. In conclusion, the extraction of relevant concepts is not most advantageous, but less problematic than in case of the time-based chart. After discussing the method of filtering, the structuring is discussed. In Figure

4.6, is visible that words with the same paradigmatic relations are grouped. For example, parool (‘newspaper of Amsterdam’), dagblad (‘daily newspaper’), krant (‘newspaper’), hoofdredacteur (‘editor-in-chief’), rubriek (‘column’) and katern (‘signature’) are grouped. However, it does not become clear from the knowl-edge graph why these terms are mentioned in the conversation. So, the context vanishes such that the chart does not help attendees of the conversation to re-member the conversation to a greater degree. A possible method to uphold the context is to highlight the named entities. In Figure 4.6, the chart, based on this possible method, is shown. Unfortunately, the named entity recognition tag-ger (NER-tagtag-ger) of the external library spaCy does not perform accurately in combination with the speech recognition software. Therefore, solely week (‘week’) is recognized as an entity. So neither structuring on paradigmatic relations nor structuring on named entities performs adequately. However, structuring based on named entities has potential when the NER-tagging works adequately. Therefore, a proof-of-concept is developed.

5.3 Proof-of-concept

Based on the evaluation of the time-based chart, the most advantageous chart should contain syntactic information. In addition, the most advantageous chart

(24)

should be structured by named entities, based on the evaluation of the topic-based knowledge graph. Hence, the most advantageous chart should be structured by named entities, and the relations between the entities should be formed with syntactic relations. This chart is shown in Figure2.4.

As both the NER-tagger and the algorithms for dependency parsing were not working adequately, a knowledge base is manually created to be able to build the knowledge graph. Due to the time-consuming nature of building a knowledge base, this knowledge base is solely created for the first minutes of the podcast De geschiedenis van het Romeinse Rijk. The knowledge base contains for every word the entity, and a topic signature, words that are topically related [1]. The topic signatures are added to capture the context of the conversation more completely. The graph, build on this knowledge base, is visible in Figure4.7. As noticeable in the figure, entities are grouped, and one word of the topic signature (if present) is shown in brackets. Using the knowledge base, the context, in which the concepts are used, becomes further visible than in Figure 4.6. However, despite the use of the manually created knowledge base, the graph in Figure 4.7 does not contain syntactic information. During the process, it became clear that adding syntactic information was not possible without completely hard coding the syntactic rela-tions. So, as stated in sectionTime-based chart, it is not possible to add syntactic information, due to the performance of the speech recognition in combination with the algorithms for dependency parsing. In conclusion, creating a knowledge graph based on named entities and topic signatures is an improvement. It was, however, not possible to add syntactic information to the knowledge graph.

(25)

Chapter 6 Discussion

In this section, the general process is discussed. First of all, the possibility of interaction with the user is discussed. After that, the general performance of the process is discussed. Subsequently, the social relevance of this research is discussed. Lastly, future work is discussed.

In the charts, it is possible to remove, move and add the relevant concepts. Thus the user is able to eliminate errors and optimize the chart to remember the conversation to a greater degree. For example, it is possible to remove all the remaining stop words from the chart to obtain a more clear version of the chart. In this research, the interactivity is solely used to optimize the final chart and to improve the experience of the user. However, the whole process could benefit from the interactivity with the user. The computer could adjust the filtering or structuring when the user corrects the chart. For example, the user removes the word gaan (‘to go’) in the final chart and consequently the computer does not extract gaan as a relevant concept in further conversations. In conclusion, solely the final chart can be adjusted through the interactivity with the user. It would be beneficial to adjust the filtering and the structuring through the interactivity with the user.

Even though the application performs correctly, there are major improvements possible. First of all, the computer is not able to transcribe the conversation at the same time as processing it. Therefore, the transcript does not contain the whole conversation. To avoid these inconsistencies in the transcript, the program only processes the conversation when the conversation is finished instead of processing during the conversation. A second improvement is regarding the Google Speech API. The Google Speech API is not able to process large audio files (longer than 1 minute). This problem was also encountered in similar research by Koster, and the proposed solution, to separate audio files into smaller chunks, is used in this research [12]. Both improvements could be implemented using more advanced

(26)

software for speech recognition.

If these improvements can be implemented and the most advantageous chart, described in sectionProof-of-concept, can be obtained without a manually created knowledge base, then this application could be advantageous in education. Re-search suggests that visualization of discussions stimulates collaborative processes and improves critical thinking in students [4]. As such the improved version of this application could aid students by creating graphical representations of con-versations and discussions. Besides, this application could specifically be helpful in online teaching because online teaching is considered to be more time consum-ing than teachconsum-ing in a classroom [20]. So, to make teaching less time consuming and more effective for students, this application could visualize the discussion or conversation.

To achieve this social relevance, further research is needed. In the first place, the most advantageous chart should be obtained without a manually created knowledge base. To achieve this goal, in further research, syntactic information should be added, and a knowledge base with named entities and topic signatures should be used to create the topic-based knowledge graph. Lastly, the interaction with the user should be able to adjust the filtering and the structuring, so the application keeps improving.

(27)

Chapter 7 Conclusion

In this research, a start has been made to make an application to visualize con-versations. This application is the product of the research question: how can conversations be visualised using graphs? The subprocesses of the application are developed by answering the following sub-questions: First, how can relevant con-cepts be extracted from a continuous flow of text? Second, how can the filtered keywords be structured such that the attendees are able to recall the conversation to a greater degree? Third, to what extent can the knowledge graph be interactive? The most effective method to extract relevant concepts from a continuous flow of text is considered to be the filtering of the nouns and the verbs, although numerous stop words were not removed. There are two methods to be considered to be the most effective to structure the filtered concepts. In the first place, the relevant concepts can be structured over time. In the second place, the relevant concepts can be structured at the hand of named entities. This method can be used with the condition that an advanced NER-tagger and a knowledge base for topic signatures should be developed for the Dutch language. Syntactic relations also appear to be preferable addition during named entity based structuring

Answering the final sub-question, the user is able to add, remove and move relevant concepts in the final charts. These corrections solely adjust the final graph to improve the user experience.

In conclusion, the presented research has shown that, besides the numerous possible difficulties, visualizing conversations using knowledge graphs is a promis-ing concept.

(28)

Bibliography

[1] E. Agirre and P. Edmonds. Word sense disambiguation: Algorithms and applications, volume 33. Springer Science & Business Media, 2007.

[2] A. Alamsyah, M. Paryasto, F. J. Putra, and R. Himmawan. Network text analysis to summarize online conversations for marketing intelligence efforts in telecommunication industry. In 2016 4th International Conference on In-formation and Communication Technology (ICoICT), pages 1–5. IEEE, 2016. [3] N. Burlutskiy, M. Petridis, A. Fish, and N. Ali. How to visualise a

conversa-tion: case-based reasoning approach. In 19th UK CBR Workshop, 2014. [4] A. DeNoyelles and B. Reyes-Foster. Using word clouds in online discussions

to support critical thinking and engagements. Online Learning, 19(4), 2015. [5] J. Donath, K. Karahalios, and F. Viegas. Visualizing conversation. Journal

of computer-mediated communication, 4(4):JCMC442, 1999.

[6] L. Ehrlinger and W. Wöß. Towards a definition of knowledge graphs. SE-MANTiCS (Posters, Demos, SuCCESS), 48, 2016.

[7] P. S. Foundation. wave, 2001.

[8] A. Gesmundo and T. Samardžić. Lemmatisation as a tagging task. In Pro-ceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 368–372. Association for Compu-tational Linguistics, 2012.

[9] T. Grainger, K. AlJadda, M. Korayem, and A. Smith. The semantic knowl-edge graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 420–429. IEEE, 2016.

(29)

[10] B. F. Green and L. K. Anderson. Color coding in a visual search task. Journal of experimental psychology, 51(1):19, 1956.

[11] D. Jurafsky. Speech & language processing. Pearson Education India, 2000. [12] T. Koster. Dynamic word cloud visualization of real-time verbal discussions.

June 2017.

[13] Z. Kulpa. Diagrammatic representation and reasoning. In Machine GRAPH-ICS & VISION 3 (1/2. Citeseer, 1994.

[14] S. Marsh. The interactive matrix chart. ACM SIGCHI Bulletin, 24(4):32–38, 1992.

[15] M. L. Murphy. Semantic relations and the lexicon: Antonymy, synonymy and other paradigms. Cambridge University Press, 2003.

[16] H. Pham. PyAudio project description, 2006.

[17] M. Postma, E. van Miltenburg, R. Segers, A. Schoen, and P. Vossen. Open dutch wordnet. In Proceedings of the Eight Global Wordnet Conference, Bucharest, Romania, 2016.

[18] R. Řehůřek and P. Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http: //is.muni.cz/publication/884893/en.

[19] H. Saggion and T. Poibeau. Automatic text summarization: Past, present and future. In Multi-source, multilingual information extraction and summa-rization, pages 3–21. Springer, 2013.

[20] D. Sheridan and S. Witherden. Visualising and inferring lms discussions. ICT: Providing choices for learners and learning, proceedings ASCILITE 2007, pages 1134–1139, 2007.

[21] SLO. Informatievaardigheden, 2020.

[22] spaCy. spaCy industrial-strength natural language processing, 2020.

[23] D. Van Garderen. Teaching visual representation for mathematics problem solving. Teaching mathematics to middle school students with learning diffi-culties, 2:72–88, 2006.

Visualizing conversations using knowledge graphs

Visualizing conversations

using knowledge graphs

Visualizing conversations using

knowledge graphs

Contents

Acknowledgments

Chapter 1

Introduction

Chapter 2

Theoretical background

2.1

Visual representation

2.1.1

Different methods of visual representation

2.2

Relevant concepts

2.3

Method

Chapter 3

Implementation & design

3.1

Speech recognition

3.2

Time-based chart

3.3

Topic-based knowledge graph

Chapter 4

Results

Chapter 5

Evaluation

5.1

Time-based chart

5.2

Knowledge graph based on paradigmatic

rela-tions

5.3

Proof-of-concept

Chapter 6

Discussion

Chapter 7

Conclusion

Bibliography