Tag cloud visualisation of verbal discussions following speech-to-text

(1)

following speech-to-text

Romy Visser 10363106 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dhr. dr. B. Bredeweg June 26th, 2015

(2)

Abstract

Automatic Speech Recognition, Natural Language Processing , and tag cloud visu-alisation are independently often used, but have not often been combined. To be able to possibly enhance verbal discussions, meetings, or brainstorm session in educational settings or workplace environments, this research aimed at combining the aforemen-tioned techniques to capture the content of verbal discussions in tag clouds. By using CMU Pocketsphinx speech recogniser with American English language and acoustic models, an average word accuracy of approximately 60% and 52% was obtained for respectively native and non-native American speakers. To acquire the content of the discussions, four NLP techniques were used: stopword filtering, singularisation, stem-ming, and the filtering of words which are not recognised as nouns according to the WordNet noun database. Lastly, the filtered text was visualised in a tag cloud with a circular layout, displaying the words from the centre to the periphery of the cloud ordered by most frequently mentioned to least frequently mention. This combination of tools and techniques has provided a proof of concept for the visualisation of the content of verbal discussions. For future research, it would be interesting to investigate the application possibilities of such a product and see how well it performs on specific tasks or with specific target groups, and based on those performance measurements, how the system can be adapted and improved.

(3)

Acknowledgement

By means of this acknowledgement I wish to express my sincere gratitude to my supervisor Bert Bredeweg, for providing me with guidance, assistance, and encouragement throughout the project.

Furthermore, I would like to thank Jasper Moes, the faculty’s audiovisual technician, for thinking along in finding the best solution for the recording of discussions and providing me with the needed equipment.

I would also like to express my gratitude to the three discussion participants: Janosch Haber, Jeroen Rijnbout, and Lotte Weerts, for taking time out of their busy schedules to conduct discussions.

Lastly, I would like to sincerely thank my family and friends for their continuous support, encouragement, and belief in me of which I am forever grateful.

(5)

1 Introduction

Within educational settings or workplace environments, it often occurs that verbal discus-sions, meetings, or brainstorm sessions are being held. While these verbal activities are conducted, it is not uncommon to forget what has just been said. However, due to the verbal aspect of the discussion, it is not possible to go back a couple of seconds in time to re-listen to what has been said. Spoken words disappear is they are not remembered by the participants. It can therefore be thought of as useful to have a supporting tool which visualises the discussion, the meeting, or the brainstorm session. With such a tool, it would be possible to look at what has been said previously during the discussion. Moreover, having such a tool available might enhance the experience of conducting a discussion by facilitating the participants with a handle on the development of the discussion. While conversing, one might not notice that the conversation is heading into a different direction than was initially intended. With the visualisation tool, this change of direction can perhaps be more easily spotted and can in that way steer the conversion (back) into the preferred direction. Fur-thermore, by recording the discussion, meeting, or brainstorm session with the visualisation tool, one has a source to refer back to once the discussion is over. It is not unthinkable that one might forget what has been said during a meeting, and consequently, having a tool which provides you with the possibility to quickly search the overview of the content of the meeting could be useful.

For such a tool, Automatic Speech Recognition (ASR) software, Natural Language Pro-cessing (NLP) tools and visualisation tools will be needed. Ideally, by applying these tools the discussions can be transcribed, the transcriptions can be filtered to obtain the content of the discussion and the content can then be visualised in such a way that the user will quickly grasp what the discussion is about. Creating a discussion support tool that works in real time would be the ideal objective. However, it is questionable if the state-of-the-art software and tools are capable of doing so. Nonetheless, it would still be interesting to investigate to what extent this could be possible. A discussion support tool which displays the content of discussions per speaker might also be preferred to be able to, for instance, refer specifically to a point one of the participants made. Overall, the addition of a discussion support tool to discussions, meetings, and brainstorm sessions in educational and workplace environments might enhance the discussions and would provide participants with a resource they can refer back to if memory is lacking.

There are several possibilities for visualising content, for example graphs (McCormick, 2013), concept maps (Eppler, 2006), or tag clouds (Heimonen et al., 2010), see Figure 1 for an example of each. A graph is defined by the online Merriam Webster encyclopaedia as: “a diagram (as a series of one or more points, lines, line segments, curves, or areas) that represents the variation of a variable in comparison with that of one or more other variables.”1 _{For the visualisation of discussions, this could mean, for example, that a graph}

is used to connect words which occur after each other. Concept maps are generally used to represent knowledge by displaying concepts and their relations which each other (Eppler, 2006). A tag cloud, available in several different layouts, is a representation of, usually, single words which differ in size or colour to indicate their importance. It is common to display

(6)

the most important or most frequent words in larger fonts than the less important or less frequent words and to assign a colour to each layer of importance or amount of occurrences. Eventually, the type of visualisation one chooses depends on the task at hand (Lohmann et al., 2009). A concept map displaying the conceptual relations might be the best option when trying to understand how the subjects discussed are associated with each other. For a quick, yet informative, glance at the main subjects of a discussion, a tag cloud would be worth considering as it draws attention to the important words through colour, size, and layout, without obscuring them with relations, lines, or nodes.

(a) Graph displaying rela-tions between participants in a learning environment (McCormick, 2013)

(b) Concept map displaying the construction of a sen-tence (Eppler, 2006)

(c) Tag cloud displaying the most commonly used for with regards to the word ‘University’ (http://net. educause.edu)

Figure 1: Three different visualisation tools

Taking the aforementioned into consideration, it is interesting to investigate the construction of a system that can capture and visualise the essence of verbal discussions, preferably per participant. This leads to the following research questions:

1. Can the content of verbal discussions be visualised? 2. How can the content of verbal discussions be visualised?

3. Can the content of verbal discussions be visualised per speaker? 4. How can the content of verbal discussions be visualised per speaker?

These questions are answered in the following sections where firstly the project is placed into a framework of preceding research. Secondly, the method and approach is elaborated on by explaining the steps taken to achieve automatic speech recognition, filtering of the transcriptions, and the visualisation by means of tag clouds. Thirdly, the evaluation of the system is presented. Subsequently, conclusions are drawn and lastly, the project is discussed and possible future work is mentioned.

(7)

2 Literature Review

The aim of the project is to visualise the content of discussions per speaker in semi-real time. A number of studies have been conducted on the visualisation of discussions. One of the differences between this project and preceding research is the shift in focus regarding the data format. The current project focusses on visualising audio from verbal discussions, whereas preceding research has mostly been focussed on written discussions, for example from online discussion fora or email correspondences (Yee and Hearst, 2005; Sheridan and Witherden, 2007; Pascual-Cid and Kaltenbrunner, 2009). This section discusses a number of the conducted researches relevant for this project and their similarities and differences in comparison to the current project. Firstly, a research seemingly closely related to the current project is discussed. Secondly, several researches aimed at visualising online written discussions are discussed. Lastly, a number of researches focussed on the different forms of visualisations are discussed.

2.1 PrimaVista - Meeting Support System

A research project seemingly closely related to the current project has been conducted by Heimonen et al. (2010) on the visualisation of multi-sensory meeting information. The aim of their research was to provide all members of the research department, who often participate in meetings, with a tool that provides the meeting participants with an increased awareness of meeting activity. The tool they aimed to develop is called PrimaVista2 _{which visualises}

the content of the meeting on an interactive client web application and on a non-interactive visualisation interface presented on a high resolution display available in the meeting room. A map of the meeting room displays tag clouds on different positions on the map, resembling where what has been said during the meeting. PrimaVista’s interface also shows how many key words have been said per minute to indicate the meeting’s activity. By doing so, they want to visualise a timeline of the meeting. This thesis’ project too aims for a timeline visualisation of the content of discussions. However, this project aims to represent the course of the content of the discussion in a timeline, whereas Heimonen et al.’s timeline displays the changes in activity during the meeting, not the content (2010).

Both Heimonen et al.’s (2010) and the current project apply automatic speech recogni-tion (ASR) to audio input from the meetings and discussions which result in transcriprecogni-tions, which, in their turn, are visualised in tag clouds. The paper’s results indicate that PrimaV-ista achieved a semi-real-time program where the program experienced a latency of one or two minutes while processing the audio input (Heimonen et al., 2010); their ASR component was optimised for accuracy in lieu of speed and it was set to process the most recent audio first. Due to this, the tag cloud visualisations were sometimes delayed and some of the audio was only processed afterwards which consequently means that not all meeting information

2_{The software, PrimaVista, as described in their article cannot be bought or downloaded online. The site}

(http://www.primavista.com/en.php) which provides information about the software suite PRIMAVISTA, a tool for active business management, does not mention the possibility of visualising meetings through speech recognition. This could potentially mean that the speech recognition and visualisation, as described in their paper, were subsequently not useful enough or had no considerable positive impact which could have caused the exclusion of their product in PRIMAVISTA.

(8)

was represented in the tag clouds during the meeting itself. Furthermore, since not all data was processed sequentially, it may be that a correct timeline of the meeting cannot be accu-mulated. This project will too produce a program operating in semi-real time. However, the main difference with Heimonen et al.’s research (2010) is the shift in priority of which audio to process. Within this project, every thirty seconds of audio will be buffered, processed and visualised, followed by the consecutive thirty seconds of audio. Processing and visualisation is time-consuming and is thus also expected to create a latency in this project. However, the latency of this approach is expected to likely be less than the one or two minutes experienced with PrimaVista. Plus, it will most likely prevent the processing of audio segments until after the discussion.

Concluding, the current project of visualising verbal discussions in semi-real time seems to be closely related to the research conducted by Heimonen et al. (2010). However, the current project will process the audio differently and sequentially and is thereby expected to have a shorter latency, creating a more semi-real-time product. Also, the course of the discussion will not be shown by a timeline indicating the amount of words mentioned during a time period, but by a timeline displaying the evolution of the tag clouds.

2.2 Online Written Discussions

As mentioned in the introduction of this section, preceding research has mostly been focussed on online written discussions rather than on verbal discussions. The structure of online writ-ten discussions differ from verbal discussions and thus possibly require a somewhat different approach regarding processing and visualising the content of these different types of dis-cussions. Online written discussions usually have a hierarchical structure (Pascual-Cid and Kaltenbrunner, 2009). These hierarchical structures often contain threads and subthreads depicting the different courses the discussions take over time. Pascual-Cid and Kaltenbrun-ner (2009) used these hierarchical structures to analyse human behaviour. It might be imaginable that these hierarchical structures can become cluttered and unclear as the dis-cussion progresses and threads and subthreads emerge. To make large, cluttered disdis-cussions clear again, Pascual-Cid and Kaltenbrunner (2009) invented a tool which allows readers to navigate through the discussion structure; they have built a system which creates an interac-tive map of the conversation and by this may provide understanding on how the discussion progressed and how controversial it was. This project would however most likely require a different approach since verbal discussions do not have (visible) threads which could possibly be visualised through a tool such as theirs.

Much research conducted on the visualisation of online written discussions have been user-oriented. The aforementioned research conducted by Pascual-Cid and Kaltenbrunner (2009) is likely to fit that category since it is aimed at visualising the social aspect of online discussions rather than mapping the content. In this type of research, researchers aim to show who said what to whom. Even though the content of discussions is taken into account, it does not receive the main focus; the main focus is the social aspect of who is interacting with whom and who is more active during discussions than others. An example of a research conducted in an educational setting with the focus on online discussions and visualising the social aspect of them, is the research conducted by Sheridan and Witherden (2007). Their research was centred around the visualisation of participant involvement of students in discussions taking

(9)

place within a learning management system (LMS), such as Blackboard. This would possibly aid teachers in acquiring an overview of the students involved in a discussion and its different subthreads, as the task of monitoring this is thought of as a difficult and time-consuming task without additional tools (Sheridan and Witherden, 2007). It may also assist instructors with interfering in the discussion if he or she feels that a specific student or discussion thread is becoming the centre of attention, if in some threads the activity decreases, or if certain threads seem to separate from the overarching discussion. Just like the research conducted by Pascual-Cid and Kaltenbrunner (2009), the above mentioned research takes the content of the discussion into account, but focusses primarily on the social aspect of the discussions. In comparison to the latter, the current project focusses primarily on the representation of the content of the verbal discussions.

In conclusion, a number of researches are focussed on online written discussions which are useful to consider as they do examine how to visualise discussions. However, verbal discussions differ in structure from online written discussions. Moreover, the current project focusses on the content representation while most research into online written discussions primarily focus on the representation of the social aspect of discussions.

2.3 Visualisation Tools

There are different visualisation tools available which each can be used for specific tasks and different forms of visualisations. Some of the tools provide semantically richer visualisations, such as concept maps (Eppler, 2006; Villalon and Calvo, 2011), than others, for example tag clouds (Zubiaga et al., 2009; Heimonen et al., 2010) and graphs (McCormick, 2013). The aim of the visualisation part of this project is to facilitate a support tool for discussion participants through which they can see in a small instant what the discussion is about and what has just been mentioned by one of the participants. Research has suggested (Lohmann et al., 2009) that a tag cloud with a circular layout which displays tags decreasing in popularity from the centre of the cloud towards the periphery, is the best layout for the task of finding the most popular tags. Discussion participants would most likely search for the most popular words in a tag cloud during and after a discussion. Consequently, the layout most suitable for the task of discovering the most popular tags in a tag cloud, would be a good choice for the current project.

2.4 Text Filtering

Natural Language Processing (NLP) is the analysis and processing of text by computers. NLP consists of several theories and techniques, such as part-of-speech tagging, tree parsing, and machine translations. Research has been done into this area as early as 1954 (Hutchins, 2005), when a GeorgeTown-IBM system demonstrated the automatic translation of sixty Russian sentences into English. The current project employs the stemming and singularisa-tion NLP techniques. The first stemmer was written in the late sixties by Julie Beth Lovins (1968). The most commonly used stemmer however was written by Martin Porter (1980) and is up to the moment of writing this thesis still the standard stemming algorithm for the English language. Porter improved and extended his work on the stemmer and wrote Snowball (2001), a framework for writing stemming algorithms, and used it to implement a

(10)

new stemming algorithm himself. The latter is the algorithm that has been used within this project. The singularisation technique, which transforms plural nouns into singular nouns, exists for quite some time now. It was, for instance, applied in a research project conducted by Paul Bowden, Peter Halstead and Tony Rose (1997) on the singularisation of nouns using a corpus-based list of irregular forms instead of a dictionary.

(11)

3 Method and Approach

The method and approach section elaborates on the approach taken to answer the re-search question. Firstly, the approach for Automatic Speech Recognition (ASR) is discussed through the use of audio recordings and software. Secondly, the different text filtering tech-niques applied to the transcriptions are explained. Thirdly, the tag cloud visualisations are discussed, and lastly, the steps taken to visualise the discussions per speaker are elaborated on. For this project, the decision was made to only work with free and mostly open source software to ensure that the project can be reproduced and similar, of not the same, results can be obtained.

3.1 Automatic Speech Recognition

The discussions focussed on within this project are verbal discussions. To be able to visualise them, it is needed to extract the words from the sounds. One way would be to manually transcribe the audio, but this is a time-consuming task. A more preferred approach would be to use ASR. ASR is not capable of providing a 100% accurate transcription, but it does perform the task significantly faster than a human can do. Since the goal of the project is a program capable of visualising an ongoing verbal discussion, manual transcription of audio recordings is not a preferred approach and so ASR software is used. Depending on the software one uses, one will get better or worse results; commercial software tends to deliver more accurate audio transcriptions than free and open source software. However, the decision was made to use free and open source ASR software in this project instead of commercial software to facilitate the possibility of reproducing this project without limitations. When choosing the ASR software, several options were explored: CMU Sphinx3_{, HTK}4_{, Julius}5_,

ISIP6, SPRAAK7, and Google’s browser based Dictation8. Preferred was software which would be as much as possible ready to use, which had activity on its website within the last two years (as to ensure software that is as up-to-date as possible), has been written in C,C++, Java, or Python, and would work on a Windows 7, 64-bit machine. With these constraints in mind, after comparing the software, CMU Sphinx was chosen as the ASR software for this project. In the sections below, CMU Sphinx and the audio recordings are elaborated on.

3.1.1 CMU Pocketsphinx

The CMU Sphinx software was considered the most promising option for this project when comparing it to the other aforementioned free and open source software. CMU Sphinx is a speech recognition toolkit with different packages built for multiple applications9_:

• Pocketsphinx: lightweight, C-written recogniser library for speed and portability

3_{http://cmusphinx.sourceforge.net/} 4_{http://htk.eng.cam.ac.uk/} 5_{http://julius.osdn.jp/en_index.php} 6_{http://www.isip.piconepress.com/projects/speech/} 7 http://www.spraak.org/ 8 https://dictation.io/ 9 http://cmusphinx.sourceforge.net/wiki/tutorialoverview

(12)

• Sphinxbase: support library needed by Pocketsphinx

• Sphinx4: flexible, adjustable and manageable, Java-written recogniser • Spinxtrain: tools for the training of acoustic models

The choice of which library to use for the speech recogniser was between Pocketsphinx and Sphinx4. According to the CMU Sphinx website10_{, the accuracy of both recogniser libraries}

is quite similar. It should not be the leading argument when choosing one or the other. It is important for this project that the software is fast since the program is intended to work in real time. Consequently, Pocketsphinx was chosen over Sphinx4 as it was built for speed.

To get CMU Pocketsphinx running, two packages are required: Pocketsphinx and Sphinxbase11_.

After compiling the Sphinxbase projects and subsequently the Pocketsphinx projects via Microsoft Visual C++ 2010 Express, Pocketsphinx was run in the terminal through the following command:

bin\Release\pocketsphinx_continuous.exe -infile [file_name] -hmm model\en-us \en-us -lm model\en-us\en-us.lm.dmp -dict model\en-us\cmudict-en-us.dict In the above command line call, the first section indicates the program file which runs the speech recognition. The second section, -infile lets the program know that one wishes to run the speech recognition on an already recorded file; if one wishes to run the speech recognition directly from real-time microphone input, one should use -inmic instead. Fur-thermore, -hmm, -lm, and -dict tell the program respectively which acoustic model files, language model, and dictionary to use and where to find it.

When running the above command line call with an audio file, the program runs and prints in the terminal the output of the transcription along with information about the process such as the amount of syllables recognised. If one wishes to write the output to an output file, one needs to specify that at the end of the command with >, followed by the name of the file:

bin\Release\pocketsphinx_continuous.exe -infile [file_name] -hmm model\en-us \en-us -lm model\en-us\en-us.lm.dmp -dict model\en-us\cmudict-en-us.dict

> [name_of_output_file].txt

If a different output folder is preferred, the (full) path can be added before the name of the output file to place the file there.

A batch file was used to run the speech recognition for multiple files. This accelerated (and semi-automated) the process of obtaining transcriptions by not having to occasionally check if Pocketsphinx has finished yet and manually change the commands to provide it with the correct input and output files. Once the audio files had been run through Pocketsphinx, the transcriptions were available in the output files. These transcriptions can subsequently be used for the creation of the tag clouds, but not before they are filtered. The filtering of the transcriptions is described in section 3.2. Section 3.1.2 describes the approach taken on the audio recordings within Pocketsphinx.

10_{http://cmusphinx.sourceforge.net/wiki/tutorialbeforestart}

11_{For this project pocketsphinx-5prealpha-win32 and sphinxbase-5prealpha-win32 were downloaded from}

http://cmusphinx.sourceforge.net/wiki/download. Be sure to rename sphinxbase-5prealpha-win32 to just sphinxbase after downloading and unpacking or else pocketsphinx will not be able to find the folder.

(13)

3.1.2 Audio Recordings

Before the transcription as described above can be conducted, audio recordings need to be obtained. Perhaps in the end, one would wish to use the program in any situation under any circumstances, but for now the accuracy of the transcription quite strongly depends on the quality of the sounds. Ideally, the audio has no noise, no background sounds or music, has native American speakers who articulate well, and lastly, no overlapping audio due to simultaneously speaking participants.

The latter is not an issue when visualising the discussion per speaker; each speaker can have its own audio stream and simultaneous speaking will not interfere with the recordings of the other person. Hence, this would not be a requisite for the audio in contrary to having one audio stream for all speakers.

The clearer the phones, the better the chances are that the speech recogniser correctly identifies a word. The more words are mumbled or strung together, or the more sounds are disrupted by background music or noise, such as wind or tumult, the harder it is for the speech recogniser to distinguish and correctly transcribe the words.

Lastly, each software most probably has its own requirements regarding the type of audio files which can be used within the program. CMU Sphinx requires the user to supply a mono-channel 16-bit 16000 Hz .wav file. If the audio file is not in the correct format, online and free downloadable programs can convert the files to the correct output12.

3.2 Text Filtering

Pocketsphinx outputs raw transcriptions of audio files. Raw in this case means that no preprocessing has been applied yet to the text. These transcriptions can already be used for the creation of tag clouds. However, the tag clouds will contain a number of words which are not useful when attempting to understand the content of discussions, for example: the, and, or, I, you, we. Furthermore, the same words but with different suffixes will not be recognised as the same word and are consequently displayed separately in the tag cloud. To get a quick overview of what the discussion is about, one would most likely want only the content-wise most meaningful words. Moreover, one would also most likely want words with the same stem but with different suffixes to count as the same word and treat them accordingly by displaying them as the same (stem) word in the tag cloud. For this project, the programming language Python was used for the processing of the transcriptions. The filtering techniques applied to the transcriptions are discussed below.

3.2.1 Stopword Filtering

The first filtering technique applied to the raw transcriptions is stopword filtering. The term stopword usually refers to the most common words in a certain language, but in principle a stopword list can be any list of words one wishes to remove from text before processing it.

12_{For this project there has been made use of Voxengo r8brain and Bigasoft Total Video Converter (to}

extract audio from video), respectively: http://www.voxengo.com/product/r8brain/ and http://www. bigasoft.com/total-video-converter.html

(14)

A Natural Language Toolkit (NLTK) has been made for Python. This toolkit is free and open source and can be used for stemming, semantic reasoning, tokenisation, and parsing. The toolkit also has the possibility to remove stopwords. The list of stopwords for the English language is predefined by the toolkit and contains 127 stopwords. The stopword list facilitates the removal of some of the most common words, such as determiners, pronouns, prepositions, and auxiliary verbs. An example of stopword filtering can be seen in Figure 2. Figure 2a displays a tag cloud of one of the raw transcriptions. As can be seen, the most common English words are mentioned most often during a conversation and are consequently displayed in the centre of the tag cloud. In Figure 2b the stopword filtering has been applied and the words in the centre of the cloud are now more meaningful when trying to determine the content of the conversation.

(a) No filtering (b) Stopword filtering

Figure 2: The same tag cloud section about the topic “undocumented persons” before and after applying stopword filtering

3.2.2 Singularisation

After the most common English words have been filtered out of the text, singularisation is applied. This second technique transforms plural nouns to singular nouns. The written function for performing this transformation makes use of the inflect NLTK package. The written function returns for each word either its singular form and adds it to a new list of words or returns False in which case the original word is added to the new list. By doing so, the plural nouns become singular and the singular nouns and non-nouns remain the same. The singularisation is applied mostly to nouns ending in ‘-s’, the most common plural suffix, but the word ‘people’ will also be singularised to the word ‘person’. This technique is applied to ensure that the frequencies of both the singular and plural version of a word are added together. This will aid the word, of which now only the singular form will be displayed, to be more visually present in the tag clouds.

(15)

3.2.3 Stemming

The third filtering technique applied to the text, which now no longer contains stopwords and plural nouns, is stemming. Stemming removes suffixes and returns only the stem of words, for example: ‘transferred’ after stemming becomes ‘transfer’. Once more, a NLTK package was used: stem. This package allows for one to choose which stemmer they want to make use of. The original, most commonly used Porter stemmer is available, but also the Snowball stemmer, which was previously mentioned in section 2.4. The English Snowball stemmer tested better than the Porter stemmer13, as expected given the fact that the Snow-ball stemmer is an expension on the Porter stemmer. Consequently, the SnowSnow-ball stemmer for the English language was chosen for this filtering technique. Stemming is applied to the text for the same reason that singularisation is applied: to aid a word to be more visually present in a tag cloud by adding the frequencies of the same words with different suffixes. Words such as ‘go’, ‘going’, and ‘goes’ will be stemmed to ‘go’. The, most likely, desirable effect is that only the word ‘go’ will be present in the tag clouds instead of all three forms.

3.2.4 WordNet

The last filtering technique applied is based on a database of nouns from WordNet. WordNet is “a large lexical database for English”14_{. It clusters nouns, adjectives, verbs, and adverbs}

into sets of cognitive synonyms and links them using lexical and conceptual-semantic re-lations. From the downloadable database, the index of nouns was taken and processed to obtain a list of nouns as recognised by WordNet. For each word in the list of words that have gone through the stopword filtering, singularisation, and stemming, a check is performed to see if the word is recognised as a noun or not. Recognised nouns remain in the list of words and non-nouns are dismissed. The content of discussions is usually represented by the nouns (and also the verbs) mentioned during the discussions. This filtering technique therefore facilitates the displaying of the more ‘meaningful’ words mentioned during discussions.

3.3 Visualisation

Tag clouds are used for the visualisation of the content of the discussions. In the sections below, the usage of the tag clouds is elaborated on, explaining the original tag clouds and the Worlde tag clouds, respectively.

3.3.1 Original tag clouds

There are some online tools, such as Wordle - discussed in the section 3.3.2, that create tag clouds. However, it is preferred within this project to have offline tools available and preferable free and open source software that can be adapted and intertwined with other code. Therefore, free and open source Python coding15 _{has been used. This tag cloud}

visualisation takes a dictionary of words and their frequencies as input and then builds a tag cloud by placing the word with the highest frequency in the middle of the page. Then,

13_{http://www.nltk.org/howto/stem.html} 14_{https://wordnet.princeton.edu/}

(16)

the second highest frequency word is placed in a open spot above, next to, or below the first word. The program then continues to place words with decreasing frequencies in a circular position around the other words, increasing the radius until a spot has been found were the word will not collide with other words. The larger the font size of the word in the tag cloud, the more often the word has been mentioned during a discussion. This type of layout is in accordance with the suggested layout by Lohmann et al. (2009). For the task of finding the most popular tags, which is what a participants is most probably looking for, this would be the preferred layout. Figure 3a and Figure 3b display two examples of the original tag cloud.

(a) Tag cloud representing a conversation about undocumented people

(b) Tag cloud representing a discussion on boat refugees

Figure 3: Examples of the original tag clouds

3.3.2 Wordle tag clouds

The original tag clouds serve their purpose and facilitate the displaying of the discussions. However, the cloud can only be changed in colour, font, and size. Some might find this layout clear and good enough, but it would be interesting to see what could also be done for the visualisation. Therefore, Worlde, an online, free, but closed source tag cloud tool, was used. By making use of its different possibilities to change the layout and physical appearance, several Wordle tag clouds have been made. This have been used in the user evaluation to acquire an insight into the tag cloud appearance that would be preferred.

Wordle allows a user to insert a piece of text and output an aesthetically pleasing tag cloud. The user can then change the number of words, the font, the alignment of the words, the colour scheme and if it prefers rounder or straighter edges. For the evaluation, a combination of those settings were used to create tag clouds. To investigate how many words would be preferred within a tag cloud, the maximum number of words was changed from 250, to 100, to 50, to 25. The participants were then asked which of the options they preferred. Other combinations of settings were, for instance, used to obtain insights into

(17)

which colouring schemes would provide optimal search. Figure 4 shows two of the created Wordle tag clouds.

(a) Wordle tag cloud with default settings (b) Wordle tag cloud with rounder edges, ‘summer’ colour scheme, and horizontal lay-out

Figure 4: Examples of Wordle tag clouds

3.4 Visualisation per Speaker

It is useful to see the content of the discussion displayed in a tag cloud to get an idea of what the discussion is about. If one would however like to respond to what a certain participant has said, it is quite difficult to do so as all that has been said is displayed in one cloud. To accomplish visualising what is said by only one speaker, two additional steps need to be conducted: one step before executing the speech recognition, and one step after acquiring the tag clouds. To be able to make a distinction between speakers, it is necessary to have separate audio channels when recording the discussions. Each participant should be recorded on its own channel to facilitate the transcription of only his part of the discussion. Once that has been done, the speech recognition, text filtering and tag cloud visualisation steps can be executed. To now be able to see what the individual participants are saying, the tag clouds should be visualised together. Within this project, the decision has been made to display the separate tag cloud together in one picture. It could also be possible to display the tag clouds for each participant on individual screens. The tag clouds are updated every thirty seconds by adding the newly spoken words to the previously spoken words. By doing so, the tag clouds keep expanding and evolving. To be able to reflect back on what has been said, the updates of the tag clouds are saved in a video. When creating the video, the audio of the participants is added to the video to not only provide a way to look back at what the tag clouds displayed, but to also hear what has been said.

(18)

(a) Three tag clouds of the participants during the beginning of the discussion

(b) Three tag clouds of the participants near the end of the discussion

(19)

4 Evaluation

Within this section, the approach taken to evaluate the different steps of the project and the results obtained are discussed, after which conclusions on the performance of the steps are drawn. Firstly, the technical evaluation is discussed for which the speech recognition and text filtering techniques have been manually investigated and evaluated. Secondly, the user evaluation procedure and outcomes for the tag clouds are discussed in section 4.2.

4.1 Technical Evaluation

As been mentioned above, the technical evaluation was done mostly by hand. It was at-tempted to automate the evaluation procedures as much as possible, but automatically compare transcriptions turned out to be more difficult to imagine and for the filtering tech-niques, it had to be manually checked if there were any outliers as one cannot really come up with a list of cases to examine beforehand. Moreover, a built automatic evaluation would have had to be tested thoroughly before being able to use it and be confident on the results it returns. In addition, if a list of checks was made for the filtering, it would still have to be examined if any outliers were overlooked. Manual evaluation thus seemed the best approach to use within this project.

4.1.1 Automatic Speech Recognition

For the evaluation of the ASR, the obtained transcriptions had to be compared to the correct transcriptions of the audio. Since these were not available, the audio recordings were first manually transcribed by a human, providing transcriptions that are as correctly as possible taking human error into account.

The audio recordings of the native American speakers contained little to no background noises and seemed, at first glance, to have been correctly transcribed; correct in this case means that there did not seem to be too many inserted words which should not be present. Both the manually and automatically generated transcriptions contained a similar amount of words. The audio recordings of the non-native American speakers however needed some preprocessing first. While recording the live discussions, the microphones of the participants also recorded the voices of the other participants. Even though the volume of those voices were low, Pocketsphinx still tried to transcribe what was being said. This cluttered the tran-scriptions, and thus the visualisations, of the participants with a large number of unwanted inserted words. Setting a threshold on the audio volumes did not help and so segments of silence were inserted into the audio recordings when the participant was not talking to obscure the voices of the other participants in the background. This solved the problem and returned correct transcriptions.

To compare the automatically generated transcription with the manually generated tran-scription, six types of words were taken into account:

• N - the total number or words • E - the exact matching words

(20)

• S - the substituted words • D - the deleted words • I - the inserted words

When comparing the two transcriptions, a manual transcription is marked by replacing the words in the transcription with one of the six letters. If a word is present in both transcriptions at roughly the same position, the word is replaced by an E. If the word is close to being an exact match, but is not entirely correct such as: ‘’discussion’ instead of ‘discussions’, or ‘to’ instead of ‘too’, the word is annotated with a C. Similar words which could be mistaken for one another, such as ‘accommodate’ and ‘commit’, or ‘disruption’ and ‘destruction’, within a couple of positions from were it is expected to be, are marked with an S. Words which are present in the manual transcription, but do not occur in the automatically generated transcription are marked with a D. And lastly, words which are present in the automatically generated transcriptions but not in the manual ones, are marked with an I.

Subsequently, for each comparison the letters in the marked transcriptions are counted and three different measurements are calculated: the accuracy (Acc) - a common performance measurement, the word error rate (WER) - a common performance measurement for speech recognition which also takes the inserted words into account on top of the substituted and deleted words, and the word accuracy (WAcc) - returning the accuracy with regards to the WER. The equations for these measurements are as follows:

1. Acc: N −D−S_N 2. WER: I+D+S_N

3. WAcc: 1 − WER

For each native and non-native American speaker the measurements were calculated, see Table 1 to 4. Table 1 and Table 2 display the performance of the native American speakers. For the first table, the number of C’s were added to the number of E’s, counting them as correct transcriptions. In the second table, the C’s were added to the number of substituted words, counting them as incorrect transcriptions.

Acc. WER WAcc.

NA-1 0.662 0.377 0.623 NA-2 0.614 0.418 0.582

Table 1: Accuracy, word error rate, and word accuracy of two native American speakers when adding C to E.

Acc. WER WAcc.

NA-1 0.628 0.411 0.589 NA-2 0.579 0.453 0.547

Table 2: Accuracy, word error rate, and word accuracy of two native American speakers when adding C to S.

In Table 3 and Table 4, the performances of the non-native American speakers are dis-played. Again, the first table displays the C’s as correct transcriptions and thus added them to the E’s, and the second table counts the C’s as incorrect transcriptions, adding them to the S’s.

(21)

Acc. WER WAcc. nNA-1 0.411 0.614 0.359 nNA-2 0.718 0.318 0.682 nNA-3 0.570 0.474 0.526

Table 3: Accuracy, word error rate, and word accuracy of three non-native Amer-ican speakers when adding C to E.

Acc. WER WAcc.

nNA-1 0.390 0.635 0.365 nNA-2 0.704 0.332 0.668 nNA-3 0.545 0.499 0.501

Table 4: Accuracy, word error rate, and word accuracy of three non-native Amer-ican speakers when adding C to S.

When classifying C’s as E’s, the native American speakers have an approximate average of 65% accuracy and 60% word accuracy. The non-native American speakers have an approxi-mate average of 57% accuracy and 52% word accuracy. Classifying the C’s as incorrect and thus as S’s, the native American speakers have an approximate average of 60% accuracy and 57% word accuracy. The non-native American speaker respectively have an approximate accuracy of 55% and word accuracy of 51%. These accuracies do not differ that much from each other; they never differ more than 8% from each other. What is interesting to see is that two of the non-native American speakers perform in similar ranges as the native American speakers, scoring 0.682 and 0.526 on word accuracy in comparison to 0.623 and 0.582 for the native American speaker.

A comparison conducted this year (2015) on different paid-for speech recognition software to determine which software belongs in the top ten of best ASR software, actually listed two ASR programs with accuracy scores of 64 and 60 per cent16_{. The current results within this}

project are not far from these results. It might be too far-fetched to conclude from this that Pocketsphinx could compete with these programs, but it is at least an indicator that the performance of the speech recognition is quite reasonable. On the website of CMU Sphinx it is indicated that each task will have a different performance score and if the performance is not good enough, one will have to consider modifying the application or training it at specific tasks17_{. Here, it is also indicated that the news recognition task, for example, scores}

a 20-25% accuracy. With the previous in mind, one could consider the obtained results as fairly reasonable, especially considering the fact that the models were not trained for any specific task or specific speaker(s).

4.1.2 Text Filtering

For the evaluation of the filtering techniques, it was manually checked whether the techniques were executed correctly without missing cases or making undesired changes. For the stop-word and WordNet filtering, a list of stop-words was printed after the filtering had taken place. This was done to iterate through the list and examine if any of the words still in the list should had been removed by the filtering technique. For the stemming and singularisation, the before and after words were printed to examine if the techniques had correctly done their job.

16

http://voice-recognition-software-review.toptenreviews.com/

17

(22)

As mentioned in section 3.2.1, the stopword list consists of 127 of the most common English words. Apart from one word in the list: ‘being’, all words in the list are indeed words one would like to filter out, such as ‘i’, ‘can’, and ‘the’. ‘Being’ is filtered out at it is seen as a verb, in which case, removal would be desirable. If however a discussion would be about different types of beings, ‘being’ would function as a noun in which case filtering is not desirable. This problem could be resolved by making the stopword list adaptable. By default, the word ‘being’ will be left out of the list, but if participants do want it filtered out, they can add it to the stopword list.

When examining the list of words returned after stopword filtering, most common words have been filtered out indeed and the list already seems clearer. Nonetheless, fillers, such as ‘ah’, ‘uh’, and ‘ehm’, and contraction, such as ‘i’m’ and ‘it’s’, are not removed. A possible solution would be to expand the contractions before applying stopword filtering; the separate terms ‘i’ and ‘am’ are part of the stopword list and will thus be filtered out once ‘i’m’ is expanded into ‘i’ and ‘am’. One could also add an extra filtering list with the most common contractions and apply filtering in the same manner as the stopword filtering. The same applies to the fillers, a filler list can be added to filter any possible occurrences out.

Apart from the aforementioned possible issues, the stopword filtering is correctly applied and executes its job as it should. However, making the list with words to remove adaptable would aid users in adapting the program to their preferences.

The singularisation technique also performs its task well. All plural nouns are transformed to singular nouns. Moreover, nouns which are uncountable or plural by default maintain the way they are, for example: ‘news’ and ‘software’ remain ‘news’ and ‘software’ after singularisation. There are nonetheless some flaws in the singularisation process; words ending on -ss are “singularised” to only one -s, so for instance ‘access’ is returned as ‘acces’. Furthermore, some of the verbs are singularised when they appear in third person, for example: ‘approves’ becomes ‘approve’. Even though this is not desired as it is not a correct application of singularisation, the effect is not a negative one. This transformation would have otherwise been conducted by the stemming filter.

In conclusion, the singularisation filtering techniques has some outliers, but performs its task well. A possible improvement could be to add extra rules for the application of singu-larisation, to avoid the outliers.

Just like the singularisation technique, the stemming technique correctly executes what it is supposed to do: it returns the stems of all verbs correctly. However, this technique too has some outliers. The stemming algorithm stems all verbs ending with the common suffixes such as -ing, -er, -ed, and -ant correctly, but it also stems some of the non-verbs ending with those suffixes, such as ‘nothing’ which becomes ‘noth’, and ‘relevant’ which becomes ‘relav’. The latter is not desired, this can be considered a negative effect. Furthermore, even though the stemming is done correctly, the returned stems are not always as informative as one would most likely prefer. The stem ‘decid’ would most likely require someone to think a bit longer than usual about what it is suppose to represent, whereas one would most likely immediately know what ‘decide’ means. The added rules for passing over words ending on an -e and transforming stems ending on -i’s to -y’s work correctly. There has so far not been

(23)

any cases in which the rules portrayed a negative effect.

Even though stemming is useful as it reduces the amount of similar words in a tag cloud by being able to add them up once they have been stemmed, the technique has some issues with words that should not be stemmed. In addition, the stem form is visually less strong than the full verb; it would most likely take an extra look before understanding that ‘decid’ means ‘decide’ and ‘mingl’ means ‘mingle’ and are not typographical errors.

Lastly, the filtering based on the acquired WordNet nouns database performs well. The remaining list of words contains only predominantly nouns and some verbs, all other words have been filtered out. On average the length of the list of words was reduced by a small 30% (0.289). This significantly reduces tag clouds in sizes and would most probably aid in the sole displaying of content of discussions.

4.2 User Evaluation

After the live discussions had been recorded, the discussions were processed and visualised by means of a video showing the build up of the three tag clouds for each of the participants during the discussions. In addition, tag clouds were made using the free online tool Wordle to use i the user evaluation. The participants were asked to fill in a evaluation form containing ten questions regarding the representation of the discussions, the applied filtering techniques and preferred tag cloud layouts. Since there were only three participants, no strong conclu-sions can be drawn, but the participants did display common opinions and preferences which could at least be considered as a starting point for future research.

Two of the questions were about the recorded discussions. They were asked how well they thought the tag clouds represented the content of the discussions and what they thought of the update-interval of 30 seconds. They all agreed that the main subjects of the discussions were represented in the tag clouds and that it was clear who spoke most often about which subjects, but they did also mentioned that any underlying relations could not really be found in the tag clouds. With regards to the update interval, they would all prefer to see a shorter interval as this would enhance the feeling of conducting discussions in real time. A different question aimed on finding out whether or not the applied filtering techniques were preferred, resulted in a unison answer that the filtering is indeed preferred. However, if they were to use the program in the future, it would be preferred to be able to choose which filtering techniques to apply.

With regards to the layout, they answered in unison that a tag cloud layout would be preferred were the most important words were displayed clearly. The preference went out for a functional tag cloud with not too many words, to not obscure the important ones, and with the most important words centred or centred as much as possible and with them displayed horizontally. The somewhat less important words could be displayed vertically or at an angle to give the cloud a certain playfulness, but only as long as the important words were clearly visible. In regards to colouring schemes, two of the participants would prefer the tag cloud to have distinctive colours on a contrasting background to have the best visibility of the words; the other participant preferred to have a richer colour scheme to aesthetically enhance the visualisation. In addition, all suggested to use meaningful colouring, for example: words of a similar semantic domain would be displayed in the same colour.

(24)

Again, it is difficult to draw strong conclusions from a user evaluation with three partici-pants, but the obtained answers do seem to indicate that the usage of a circular layout with the words displayed horizontally and the importance of the words decreasing from the centre to the periphery is the best possible layout for a task in which the most important words need to be found. Which is also in line with the findings of a research project conducted on the comparison of tag cloud layouts (Lohmann et al., 2009). Furthermore, even though the tag clouds may not be the best visual representation for semantically rich information, they are capable of representing the content of the discussions reasonably well.

4.3 Conclusions

To conclude the evaluation, all three steps appear to perform reasonably well and as expected or better. The ASR software scored an average of 55% word accuracy which can be considered a fairly reasonable result. The NLP filtering techniques applied to the transcriptions seem to do what they are intended for and perform overall well, however, their performance could be improved if extra rules are added to guide the techniques in their application. Lastly, the tag clouds are considered reasonably well representations of the discussions’ subjects. The chosen layout for the original tag cloud is in line with research findings of Lohmann et al. (2009) and mostly in line with the preferred layout as mentioned by the participants.

(25)

5 Conclusion

For the enhancement of verbal discussions, meetings or brainstorm sessions within educa-tional settings or workplace environments, it could be useful to investigate if it is possible to capture the content of the discussions and display them to the participants during the discussions. This could enhance the discussions by aiding memory support of what has been said and possibly steer the discussions. To investigate the possibility of visualising verbal discussions, this research project set out to answer four questions: can the content of verbal discussions be visualised? If so, how can the content of verbal discussions be visualised? Can the content of verbal discussions be visualised per speaker? And lastly, how can the content of verbal discussions be visualised per speaker? These questions have been answered by first transcribing audio recordings of native and non-native American speakers using the Auto-matic Speech Recognition software Pocketsphinx making use of American English language and acoustic models. Once the raw transcriptions were obtained, four Natural Language Processing techniques have been applied to filter out any words which would not be relevant for the content of the discussions. Lastly, the words from the filtered text were counted and displayed in tag clouds. To then display the content of the discussions for each par-ticipant individually, the audio recordings were split to only contain the recordings of one specific participant. The three steps of ASR, NLP filtering, and visualisation of the content in tag clouds were then applied for each of the participant’s recordings and their obtained tag clouds were displayed in one screen side by side. Even though it is not yet possible to execute these steps in real time, it is possible to perform the entire process chronologically in a semi-real time fashion.

Concluding, by executing three steps: transcription of audio to text by means of ASR, filtering the transcriptions using NLP techniques to obtain the content of the discussions, and displaying it in tag clouds for all participants together or for each participant individually, the research questions have been answered and the objectives have been achieved. The content of verbal discussions can be visualised in semi-real time for all speakers combined or for each speaker individually by means of tag clouds.

(26)

6 Discussion and Future Work

The project currently makes use of American English language and acoustic models, while conducting the live discussions with non-native American speakers. Initially, the inten-tion was to achieve automatic speech recogniinten-tion for the Dutch language. It is possible to download a Dutch dictionary and acoustic model, however, a Dutch language model is not downloadable. Without the language model, the acoustic model and dictionary are not ca-pable of yielding a Dutch transcription. While it is possible to build a language model, the focus of the project would have shifted too much towards the speech recognition part of the project as building such a model requires time and resources (a complete WikiPedia article database) and would also require an adaptation of the acoustic model and dictionary. Pock-etsphinx was already capable of yielding transcriptions using the American English models and therefore these models were used.

A problem that could have occurred with the above approach is that the accuracies of the non-native American speakers could have been significantly lower than those of the native American speakers. The difference in accents on which the acoustic model has not been trained, could have been a cause for that. However, as the results indicate, the combining of non-native American speakers with the American English models did not seem to have a large negative effect on the accuracies obtained.

For future work however, it would most probably be beneficial for the speech recognition results to coordinate the models with the native tongue; native Dutch speakers will most likely obtain higher accuracies when using Dutch models while speaking Dutch in compar-ison to the usage of American English models while they speak English. It might also be useful to train the models on specific speakers or at least at specific domains. The more generally applicable a model needs to be, the lower the odds that it will work really well on one specific topic or within one specific domain. To be capable of obtaining better results, it would therefore be interesting to investigate how and how much the speech recognition can be improved by coordinating the models more with specific speakers, groups of speakers, and domains.

Within this project, the recordings of live discussions have been conducted with three par-ticipants which were also the people who did the user evaluation. The answers obtained from their evaluations are useful as a starting point for further investigations, but are not enough to acquire insights into certain details. To acquire more insightful knowledge about the performance of the visualisations or about the best possible layout, testing with a larger group of participants is necessary. One of the insights which would be useful to obtain is to discover how much of the content of the discussions can be extracted from the tag clouds rather than concluding which of the words were mentioned most often during the discussion and would therefore represent the subject. Another insight to be obtained is to test whether a tag cloud is a goo visualisation tool to use during discussions to provide the participants with knowledge about the content of the discussions; are tag clouds clear and simple, yet informative enough to provide actual support. Furthermore, one might also like to discover which layout is preferred, how many words a tag cloud should ideally contain and what the purpose of colour schemes should be, whether they are only applied to make tag clouds more visually inviting or whether they serve a purpose. Concluding, testing with more people is

(27)

preferred for the purpose of acquiring more insightful information on what makes a tag cloud a useful discussion visualisation tool.

Even though the process of speech recognition, filtering, and visualisation does not experience a latency of one to two minutes unlike the meeting support tool researched by (Heimonen et al., 2010), it still does not work in real time. The semi-real time process could possibly be sped up by applying smarter filtering techniques or faster visualisations. However, as long as the speech recognition software does not improve in processing speed, achieving real-time visualisations will be difficult. It would still be very useful though to conduct more research into the realisation of real-time speech recognition, processing, and visualisation to be able to make the program work in a near real time.

Furthermore, as the tag clouds currently only visualise the frequencies of words, it could be useful to conduct further research into the visualisation of semantically richer tag clouds. This could be done by, for instance, applying clustering, or indicating relations between words by specific colours. It might also be interesting to investigate what kind of influence the usage of n-grams would have on the representation of discussions’ content. It might then also be interesting to investigate what the influence of different visualisations, such as concept maps or graphs, is on the extend in which content is represented.

The tag clouds visualisation is currently a static, non-adaptable tool. In the future it might be interesting to shift towards an interactive design in which the participants can, for in-stance, remove words from the tag cloud they do not want to see displayed and also enlarge words which they feel should be portrayed more prominently. This would make the tool interactive and adaptable and could possibly bring the tool from the background to the foreground.

Lastly, as it has now been proven that it is possible to visualise the content of discussions, it would be interesting to investigate how well a program as this would perform when targeted at specific settings. The tool such as this could be used in educational settings, such as classrooms, seminars, or within group projects. It could be used for meetings or brainstorm sessions, or possibly as a support tool for lectures. By doing so, it can be tested how well a tool like this would perform in a real context and if the tool would need adaptations and if so, to what extent.

(28)

References

Bowden, P. R., Halstead, P., and Rose, T. G. (1997). Dictionaryless english plural noun singularisation using a corpus-based list of irregular forms. Language and computers, 20:339–352.

Eppler, M. J. (2006). A comparison between concept maps, mind maps, conceptual diagrams, and visual metaphors as complementary tools for knowledge construction and sharing. Information visualization, 5(3):202–210.

Heimonen, T., Ovaska, S., Turunen, M., Hakulinen, J., Rajaniemi, J., and Raiha, K. (2010). Visualization fo Multi-sensory Meeting Information to Support Awareness. Information Visualisation (IV), 2010 14th International Conference, pages 194–199.

Hutchins, J. (2005). The first public demonstration of machine translation: the georgetown-ibm system, 7th january 1954. noviembre de.

Lohmann, S., Ziegler, J., and Tetzlaff, L. (2009). Comparison of Tag Cloud Layouts: Task-Related Performance and Visual Exploration. Proceedings of the 12th IFIP TC 13 Inter-national Conference on Human-Computer Interaction: Part I, pages 392–404.

Lovins, J. B. (1968). Development of a stemming algorithm. MIT Information Processing Group, Electronic Systems Laboratory.

McCormick, J. (2013). Visualizing interaction: Pilot investigation of a discourse analytics tool for online discussion. Bulletin of the IEEE Technical Committee on Learning Tech-nology, 15(2):10.

Pascual-Cid, V. and Kaltenbrunner, A. (2009). Exploring Asynchronous Online Discussions Through Hierarchical Visualisation. Information Visualisation, 2009 13th International Conference, pages 191–196.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137.

Porter, M. F. (2001). Snowball: A language for stemming algorithms.

Sheridan, D. and Witherden, S. (2007). Visualising and Inferring LMS Discussions. ICT: Providing choices for learners and learning, proceedings ASCILITE.

Villalon, J. and Calvo, R. A. (2011). Concept maps as cognitive visualizations of writing assignments. Journal of Educational Technology & Society, 14(3):16–27.

Yee, K.-P. and Hearst, M. (2005). Content-centered Discussion Mapping. Online Delibera-tion.

Zubiaga, A., Garcia-Plaza, A. P., Fresno, V., and Martinez, R. (2009). Content-Based Clustering for Tag Cloud Visualization. International Conference on Advances in Social Network Analysis and Mining, pages 316–319.

Tag cloud visualisation of verbal discussions following speech-to-text