Opportunities and Challenges for Adding Speech to Dialogues with a Council of Coaches

(1)

Opportunities and Challenges for

Adding Speech to Dialogues with

a Council of Coaches

Laura Bosdriesz

MSC INTERACTION TECHNOLOGY

Faculty of Electrical Engineering Mathematics and Computer Science

EXAMINATION COMMITTEE

Dr. Ir. D. Reidsma (University of Twente, the Netherlands) D.P. Davison MSc. (University of Twente, the Netherlands) Dr. Ir. H. op den Akker

18-05-2020

(2)

ABSTRACT

With the ageing of the population, more (chronic) diseases arise, putting pressure on the health care system. This requires a shift from treatment towards prevention of agerelated diseases by enabling the aging generation to stimulate them to take care of their own health and con dition. The Council of Coaches project is developing a team of virtual coaches that can help older adults to achieve their health goals. Each coach offers insights and advice based on their expertise. The Council of Coaches enables multiparty interaction between multiple coaches and the users by leading usercoach, but also coachcoach conversations.

Currently the Council of Coaches interacts with the coaches by selecting one of several prede fined multiplechoice options. Although this a robust method to capture user input, it is not ideal for elderly for several reasons, such as physical function loss. Spoken dialogues can offer a better user experience, but also come with many complexities. In this research topics, the state ofart in this topic is investigated by answering multiple research (sub)questions. An extensive literature research has been conducted to gain knowledge about: Conversational Interfaces, Speech Recognizers, Virtual Coaching Systems and the Council of Coaches.

This research topics serves as an approach for my thesis that attempts to answer the main question: To what extent can spoken interaction offer a valuable addition to the multiparty virtual Council of Coaches application?.

All gained background knowledge will help with the adaption of the Council of Coach system to support spoken interactions.

(3)

LIST OF FIGURES

1.1 The Council of Coaches living room User Interface. Figure reproduced from [1] . 9 1.2 The Council of Coaches scripted responses to interact with the coaches. Figure

captured fromhttps://www.council-of-coaches.eu/beta/ . . . 10

2.1 The components of a spoken language conversational interface. Figure con structed based on the steps described in [2] . . . 16

4.1 KRISTINA prototypes [3], captured from (http://kristina-project.eu/en/) . 25 4.2 The Meditation Agent. Figure reproduced from [4] . . . 27

4.3 FitTrack interfaces . . . 27

4.4 WizardAgent Setup for inhome social support agent for elderly. Figure repro duced from [5] . . . 28

5.1 Council of Coaches Main Menu screen. Figure reproduced from [1] . . . 31

5.2 First screen of the account creation process, guided by Coda. Figure reproduced from [1] . . . 31

5.3 The coach selection with recommendations. Figure reproduced from [1] . . . 32

5.4 The Physical Activity Book widget. Figure reproduced from [1] . . . 32

5.5 The four role playing games where COUCH is influenced by . . . 33

5.6 The hierarchy of coaching topics. Adapted from [6]. The blue nodes are topics and the grey nodes has subtopics themselves. . . 34

(6)

LIST OF TABLES

5.1 The seven coaches from the Council . . . 30 5.2 Descriptions of the coaching topics (the blue nodes in Figure 5.6) . . . 35

(7)

ABBREVIATIONS

ASR Automatic Speech Recognition

COUCH Council of Coaches

DM Dialogue Manager

NLU Natural Language Understanding

RRD Roessingh Research and Development

SDS Spoken Dialogue System

SLU Spoken Language Understanding

TTS TexttoSpeech Synthesis

VPA Virtual Personal Assistant

VUI Voice User Interface

(8)

Chapter 1 INTRODUCTION

Population ageing is a phenomenon that has been evident for several decades in Europe. The population of people older than 65 years is expected to increase from 101 million in 2018, to 149 million by 2050. This increase is even larger in the older population aged 7584 years (60.5%), compared to the population aged 6574 years (17.6%) [7]. Since ageing increases the risk of agerelated diseases and decline in function, many of these additional years will be lived with chronic diseases [8, 9]. Some of the commonly prevalent agerelated chronic conditions are diabetes, cardiovascular diseases and chronic obstructive pulmonary disorder (COPD) [10]. Additionally, elderly suffer from functional deterioration, revealed in decreased mobility, and vi sion and hearing loss [11]. This shift in population aging and its related healthissues is likely to have a considerable impact on healthcare. This requires a shift from treatment towards pre vention of agerelated diseases by enabling the aging generation to stay independent longer and stimulate them to take care of their own health and condition. Research has proven that innovative solutions in the area of mobile health (mhealth) and electronic health (ehealth) can be useful in personalizing the care provided [12–14]. Key features of these health areas may include nutrition management, exercise planning, appointment scheduling and medicine track ing [12].

With the advancements in digital healthcare, health coaching can be provided by virtual coaches. Virtual coaching in the broader sense is a coaching conversation that can take place digitally, without direct involvement of a human coach. In this work, the focus is on virtual coaches that take the form of a virtual computer character, running on webbased platforms or smartphone applications. Some examples of virtual health coaching systems can be found in Chapter 4. Personalized coaching uses strategies applied to the user’s personal characteristics such as perceived barriers, personal goals and health status. Coaching interventions that are aimed at sending reminders, tracking goals, or providing feedback are designed for one individual [15]. Given that adequate coaching for elderly is important to reduce the pressure on the healthcare system, solving this by means of human coaches is not feasible and scalable to the required level. Ehealth and mhealth technologies using virtual coaches, provide a good infrastructure for personalizing and tailoring the intervention.

Examples from literature show that personalized and virtual coaching in healthcare can be done by providing more health related information to elderly and using virtual personal coaching, counseling and lifestyle advice to persuade and motivate them to change their health behaviors [16]. It has been investigated for some time already, especially to support patients with chronic conditions [17, 18]. For example, Klaassen et al. [19] developed a serious gaming and digital coaching platform supporting diabetes patients and their healthcare professionals. Although those single coach systems has already shown a positive effect, a better performance in health coaching is expected to be achieved through a multiagent virtual coaching system [16, 20].

(9)

For this reason, the “Council of Coaches” (COUCH) revolved around the concept of multiparty virtual coaching.

1.1 Council of Coaches

The Council of Coaches1 is a European Horizon 2020 project aiming to develop a tool to pro vide multiparty virtual coaching for elderly to improve their physical, cognitive, mental and social health [21]. The council consists of a number of coaches, each specialized in their own specific domain. They will interact with each other, and also listen to the user, ask questions, inform, jointly set personal goals and inspire the users to take control over their health and wellbeing. One of the objectives of the project was to develop coaches as interesting characters. This character design is reflected mainly by providing every coach with its own background story and related personalities. Any combination of specialized council members collaboratively cov ers a wide spectrum of lifestyle interventions, with the main focus on agerelated impairments, chronic pain, and Diabetes Type 2. The project includes seven coaches and a robot assistant, who leads the interaction between the user and the system. In Figure 1.1, all seven coaches and the robot assistant are present in the Council of Coaches living room. From left to right, these are: The peer support, the physical activity coach, the social coach, the diabetes coach, the cognitive coach, the chronic pain coach, the robot assistant and the nutrition coach. An extensive description of these characters and their development is provided in Chapter 5.

Figure 1.1: The Council of Coaches living room User Interface. Figure reproduced from [1] Users are provided with an interface using buttons with scripted responses to interact with the coaches (see Figure 1.2). This setup limits user input, which has the strength of giving the coaches more clarity on what they are responding to [22]. The coaches themselves can natu rally keep an interaction going by supporting or contradicting each other to increase the user’s engagement, active participation, reflection and critical thinking about their own health. Besides the interaction with the coaches, users can setup the system and change settings by talking to the robot assistant. The system is able to give feedback to users, such as notifications, rep resentations of their behavior, progress towards their goals and required consent needed for using and storing specific data. Another implementation of the application is the possibility to connect a Fitbit. The Fitbit can track the user’s activity level and its progress in achieving the health goal. More details about the Council of Coaches and its specifications is provided in Chapter 5.

1

(10)

Figure 1.2: The Council of Coaches scripted responses to interact with the coaches. Figure captured fromhttps://www.council-of-coaches.eu/beta/

1.2 Why Speech?

As described in the previous section, COUCH is a textbased application using multiple written options to choose from in order to interact with the coaches. Other input options for an ehealth application could be speech or free text. Free text is different from the restricted approach of COUCH in the sense that users have the opportunity to type anything they want during an inter action. The benefit of the approach COUCH takes, is that coaches are certain about what they are responding to, which is more difficult when having to recognize speech and free text. There are some specific features that distinguish the language of conversation from the language of written text, causing some complex issues (which are discussed in more detail in Chapter 2) Nevertheless, at the same time speech brings many additional advantages, especially for el derly. Potential benefits can be found in the level of engagement, the maintenance of longterm relationships, to solve the loneliness problem, and to overcome physical barriers of the ageing process.

Speech can contribute to one of the major challenges in ecoaching, which is to keep the user engaged for a longer period of time [23]. Turunen et al. [23] argue that if there is no longlasting engagement, the health coach cannot have any further impact on the behavior change. They also found that building longterm relationships between the user and the system can bene fit the level of engagement. Finally, Turunen et al. [23] obtained positive results for building a social and emotional humancomputer relationship with a physical presence of an interface agent, using spoken, conversational dialogues. Other results from literature review in the field of virtual health coaching showed that speechbased virtual characters can improve user satis faction and engagement with a computer system [24]. Both studies suggest a potential benefit for the implementation of a spoken conversational interface. Speech might be able to improve engagement because it can make the humancomputer interaction more natural and in this way improve the effects of coaching. However, it can also decrease trust because it can create ex pectations of the system which it may not be able to fulfill.

Another area where speech can contribute is in the field of social companion and peer support. Loneliness is a common problem in today’s’ older population, and since it is closely associated with depression [25], it is important for elderly not to feel lonely. The COUCH project contains a peer and support character, who also takes advice from the coaches and is there to share his

(11)

experiences with the user from an equal friend viewpoint. Additionally, there is a social coach who can help the user with tips and advice on leading a socially active life. These characters are already important to prevent loneliness among elderly, but by implementing speech the char acters may even improve their effectiveness as social companion. They can feel like a virtual friend when a more natural conversation in a home setting can take place, possibly decreasing the feeling of loneliness among elderly. Next to holding conversations, virtual friends could read books or other longform documents to help the users, although the sounds of the automatically generated voices might still be somewhat robotic [26].

One last big advantage of implementing speech into an application for elderly, is to overcome barriers to access information. For many users, especially elderly, the ability to read and type decreases the usability and ease of use of an application. Conversational interfaces can bridge this gap by allowing them to talk to the system [26]. In a more extreme version, conversational interfaces can avoid manual input methods like the keyboard and mouse, making it a com fortable and efficient input method for people with physical disabilities and functionloss (like elderly) [27]. Additionally, younger adults and elderly differ in the way they interact with tech nology whereby elderly face more difficulties interacting with computers. Literature showed that speech could be one of the most natural and effective modalities to overcome elderly’s problems related to their attitudes and experience with technology in general [28].

1.2.1 Problem Statement

Considering these promising advantages, especially for elderly, this research aims to assess the benefits of implementing speech in the current ehealth application of the Council of Coaches. It does not aim to develop a conversational interface that is perfectly able to imitate humanto human natural conversations, but instead, to investigate whether the state of art is developed enough to create a reliable and usable conversation system in a daily life setting. Prior to ac tually making and experimenting, we need to assess the potential based on literature, which is the topic of this report. It will attempt to find evidence that this is indeed an area of promise and that users experience additional advantages in using a speechbased COUCH implementation. This project is an addition to the work described in Chapter 5. This goal is formalized in the following research question:

To what extent can spoken interaction offer a valuable addition to the multiparty virtual Council of Coaches application?

This research question is very broad and cannot be answered by one single measure. In order to determine whether or not speech is of value to the Council of Coaches project, it is necessary to define the current state of art in conversational interfaces, together with its advantages and pitfalls. In addition, tricks and methods to overcome these pitfalls should be charted. To achieve this, the research question is divided into subquestions:

(12)

SQ:1 What are current limitations of the experienced humancomputer interaction

in the Council of Coaches application?

SQ:2 How can these limitations be overcome by augmenting the COUCH interface

with speech?

SQ:3 What problems can arise along with the implementation of speech in a project

like the Council of Coaches?

SQ:4 What systems have been used in the past, and what where the advantages

and disadvantages?

SQ:5 What are the pitfalls in speech recognition systems used for elderly and are

there any systems specifically designed for them?

1.3 Approach

In order to find answers to the questions posed in the previous section, this project will start with an extensive literature research in the field of conversational interfaces, speech recogniz ers and virtual health coaching systems. Additionally, it will investigate the current stateofart of the Council of Coaches project. Expert interviews with researchers of Roessingh Research and Development (RRD) will be conducted to gain more knowledge for answering the research questions. This interview contains questions regarding the user experiences with the project, including usability issues that might be overcome by implementing speech, but also it also con tains questions about the problems coming along with speech. The literature research and expert interview are part of this research topics and provide answers to SQ: 1, 4, 5, and to some extend SQ: 2 and 3. Hereafter, a conversational interface will be implemented in the system framework of COUCH. This system will be put to the test in an user experiment in order to answer the main questions, and the remaining of SQ: 2 and 3. This is out of scope for this research topics, but will be addressed in the main graduation project, the continuation of this work.

1.4 Document Structure

Chapter 2: Conversational Interfaces

In this chapter, first an introduction in conversational interfaces is given, including all its related concepts. This is followed with an introduction in conversation mechanisms and the chapter ends with a small review on the technologies involved in conversational interfaces and its ac cessory challenges of implementing natural speech.

Chapter 3: Speech Recognizers

In this chapter, the stateofart of speech recognizers is explored. This stateofart relates to closed source, mostly commercial, speech recognition systems and opensource speech recog nition systems. It describes the performance, availability, costs and supported languages of different systems in order to decide for a suitable system for current project.

Chapter 4: Related Work in Virtual Coaching Systems

In this chapter, the practice of several virtual health coaching systems is explored. It includes a description of the advantages and limitations experienced in these related works. This infor mation can help to anticipate pitfalls before the implementation of speech.

Chapter 5: The Council of Coaches Platform

(13)

scription of the coaching domains, the coaches and their coaching content and actions, the interface of COUCH, the dialogue style and structure, the user experiences, and the WOOL di alogue platform which has been developed for the COUCH project. This section also includes the expert interview.

Chapter 6: Conclusions

The last chapter provides a summary of the main findings of this research topics and it (partly) answers the subquestions. It highlights the most important experiences with current COUCH project, the chosen speech recognizer, and the main pitfalls and advantages of speech experi enced in previous works.

(14)

Chapter 2 CONVERSATIONAL INTERFACES

In this chapter, an introduction in conversational interfaces and conversation mechanisms is provided as first. The remaining of the chapter gives an overview of the technologies involved in conversational interfaces and their difficulties and technical challenges of implementing natural speech. Chapter 2 helps to answer subquestion 3 by researching the limitations of conversa tional interfaces.

2.1 An Introduction in Conversational Interfaces

Conversational interfaces enable people to interact with smart devices using spoken language in a natural way—just like engaging in a conversation with a person. A conversational interface is an user interface that uses different language components (see Section 2.3) to understand and create human language and in this way can mimic human conversations [2].

The concept of conversational interfaces is not very new. According to McTear, Callejas and Griol [2], it already started around the 1960s with textbased dialogue systems for question answering. Somewhat later, around the 1980s, the concept of spoken dialogue systems has become important within the speech and language research. They mention that spoken dia logue systems (SDS) and voice user interfaces (VUI) are different terms for somewhat similar concepts (i.e. they use the same spoken language technologies for the development of interac tive speech applications), although the difference is in their purpose of deployment. SDS have been developed for academic and industrial research, while at the same time VUIs have been developed in a commercial setting [2]. These systems are intelligent agents that use spoken interactions with a user to help them finish their tasks efficiently [29]. The commercial interfaces (VUIs) often not take a dialogue structure, but instead use a spoken command and response interface. This means that an interaction with these systems start with a request from the user, which is processed by the system. The system generates an answer and send this answer back to the user. For the ease of reading, I will use the term “spoken dialogue systems” in the remaining of this report to comprise both the SDS and VUI.

More and more companies incorporated SDSs in their devices to establish various kinds of Virtual Personal Assistants (VPAs) [30]. These VPAs entail software agents that run on smart phones and other devices containing an inbuilt speaker [26]. Virtual Personal Assistants are also known by various names such as personal assistants, intelligent assistants, digital per sonal assistants, mobile assistants or voice assistants [2]. Examples of VPA’s include Apple’s Siri, Google’s Assistant, Microsoft’s Cortana, Amazon’s Alexa, Samsung’s S Voice, Facebook’s M, Baidu’s Duer, and Nuance Dragon. There will be elaborated upon some of these VPAs in Chapter 3.

(15)

VPAs are part of a larger group of digital agents which include spoken dialogue systems, text based agents and embodied conversational agents (ECAs) [2]. Embodied conversational agents are computergenerated characters with an embodiment, used to communicate with an user, providing a more humanlike and more engaging interaction. ECAs can be used for user support in internetbased ehealth applications [31], as for example the COUCH project. The main ben efit of ECAs is that they allow humancomputer interaction in the most natural possible setting, namely with gesture, body expressions and speech to enable facetoface communication with users [24]. ECAs are often used as virtual coaches in healthrelated systems. A few examples from literature will be described in Chapter 4.

2.2 Conversation Mechanisms

The main objective of a conversational interface is to support conversations between humans and machines. Understanding how human conversations are constructed is an important aspect in the development of a conversational interface. In general, participants take turns according to general conventions (turntaking), they collaborate for mutual understanding (grounding), and they take measures to resolve misunderstandings (conversational repair) [2].

Turntaking is a difficult concept to describe, but informally it may be described as “stretches of

speech by one speaker bounded by that speaker’s silence – that is, bounded either by a pause in the dialogue or speech by someone else” [32].

Grounding is the process of reaching mutual understanding between participants and keep

ing the conversation on track, for example by providing feedback or adding information [33]. In designing conversational interfaces it is important to understand how understanding can be achieved, but also how misunderstanding may arise and how can be recovered from the com munication problems.

Conversational repair is the process for repairing failures in conversations through various

types of repair strategies, initiated by either the speaker or the interlocutor [2].

Some design issues for conversational interfaces come from the complexity of implementing these conversation mechanisms in the dialogue manager, explained in Section 2.3.3.

2.3 The Technologies in Conversational Interfaces

The major components of a dialogue system are automatic speech recognition(ASR), spo ken language understanding(SLU), dialogue management(DM), and natural language genera tion(NLG) ( [34]; [2]). The steps involved in a conversational interface are as follows:

1. The ASR component processes the words spoken by the user in order to recognize them. 2. The SLU component retrieves the user’s intent from those words.

3. The DM tries to formulate a response or, if the information in the utterance is ambiguous or unclear, the DM may choose to query the user for clarification or confirmation.

4. The NLG component constructs the formulated response, if desired. 5. The texttospeech synthesis are utilized to produce the spoken response.

An overview of a complete spoken language conversational interface can be found in Figure 2.1.

(16)

Figure 2.1: The components of a spoken language conversational interface. Figure constructed based on the steps described in [2]

2.3.1 Automatic Speech Recognition Systems

Automatic Speech Recognition systems are responsible for the speech input. i.e. it has to be able to recognize what the user has said by transforming the speech into text [2]. An ASR uses three models: the acoustic model, language model and lexicon. The lexicon describes how words are pronounced phonetically. The acoustic model splits the audio input into small frames and its responsibility is to predict which phoneme is being spoken in each frame of audio. The language model component captures the probability of word sequences and its job is to predict which words will follow on from the current words. These models together decode the audio input into a best guess transcription of the words being spoken. Since there is a lot of variation in spoken input, it is difficult to accurately process every input. This input variation, together with other factors causing difficulties in ASR, will be discussed in Section 2.4.

There exist three types of ASR systems: (a) speaker dependent, (b) speaker independent, and (c) speaker adaptable. Those systems differentiate in the amount of user training required prior to use. Speakerdependent ASR requires training with voices of speaker prior to use, while speakerindependent does not. The latter systems are pretrained during the system devel opment with data from many speakers. In general, speakerdependent ASR systems achieve better accuracy than speakerindependent systems, although the latter systems still have a relatively good accuracy when a user’s speech characteristics fall within the range of the sys tem’s collected speech database. For example, a speakerindependent system trained with data from Western male voices, in the age range of 2040, should work for a random 30year old Western men, but not for an Asian 85year old women. Speakeradaptable ASR is similar to speakerindependent ASR, but with the difference that the adaptable ASR improves its accu racy by gradually adapting to the user’s speech [35]. However, since the primary goal of ASR research has been to create systems that can recognize spoken input from any speaker with a high degree of accuracy, and speakerdependent systems are very time and effort consum ing, most commercial systems are independent. However, as further explained in Section 2.4, elder’s speech often not fall within the range of collected speech used for the development of the commercial systems. This may cause these ASRs not to work optimally for the target group of current research. In this case, not a speakerdependent, but a groupdependent approach could work to have an ASR suitable for recognizing elderly speech. However, a lot of data from older speakers needs to be collected in order to train the ASR, which is out of scope for this research.

(17)

2.3.2 Spoken Language Understanding

Spoken language understanding (SLU) involves the interpretation of the semantic meaning, conveyed by the ASR transformed spoken input. Traditional SLU systems contain an ASR and a natural language understanding (NLU) component. The acoustic model of the ASR compo nent splits the audio input into small frames and the NLU component, on its turn, provides a semantic label to each of these segments. This means that the fragments of the sentences are labeled with, for example, noun, verb, determiner, preposition, adverb or adjective. These se mantic meaning of the different fragments are used to understand the complete input sentence, which can then activate the subsequent behavior in a humancomputer conversation. Since SLU use ASR components, its performance largely depends on the ASR’s ability to correctly process speech, with a low word error rate (WER) [36]. The WER is the percentage of words that are missed or mistranslated by the ASR. Completely parsing the grammar of a sentence only functions when the ASR is close to perfect with a very low WER. Nevertheless, the NLU performance can be made more robust against a bad performing ASR by shaping the NLU ap propriately. In general, most practical conversational interfaces tried to achieve this robustness by using sets of attributevalue pair representations to capture the information relevant to the application from the speech input [2]. These attirbutes are predefined concepts related to the field of the application, while the values are the attributes related specification. For example, when looking for a health care institution with the specifications; ”a dentist close to my house”, the health institution ’type’ and ’location’ are the attributes, while the ’dentist’ and ’near’ are the value. This approach is robust as long as the right keywords can be retrieved.

2.3.3 Dialogue Management

Dialogue management (DM) relies on the fundamental task of deciding what action or response a system should take, based on the semantic meaning of the user input. This user input and output can be either a textual or vocal response [37], with the former being the case for COUCH. It is an important part of the conversational interface design, given that this component entails all its dialogue structures and content. In addition, the DM is the main responsible for user sat isfaction because its actions directly effect the user. DMs can be used in an extensive variety of fields, such as business enterprises, education, government, healthcare and entertainment. The area of interest for COUCH are dialogue frameworks developed by the interactive story telling community. Based on these frameworks, COUCH developed its own dialogue framework ”WOOL”. This platform is described in more detail in Section 5.6. Each dialogue management tool depends on the specific characteristics of the dialogue system type it has been created for. These characteristics include the task, dialogue structure, domain, turns, length, initiative and interface. Additionally, dialogue management tools differ in the way it can be authored. With some tools the dialogue strategy is in the hands of the dialogue author, while in others it is in hands of the programmer because it requires programming expertise to adjust some of the general builtin dialogue behaviors. By using a multiplechoice structure, dialogues can be decomposed in smaller dialogue units that can be changed and handled separately.

The approach to design a dialogue manager is partly dictated by the purpose of the application, but three broad classes can be encountered: taskoriented systems, conversational agents, and interactive question answering systems [38]. COUCH falls in the class of conversational agents. As already mentioned in Section 2.2, two frequent design issues within DMs for con versational agents are the interaction and confirmation strategies [2].

Interaction strategies determine who takes the initiative in the dialogue – the system or the

user? There exist three types of interaction strategies: Userdirected (i.e. user is leading the conversation), systemdirected (i.e. system is leading the conversation by requesting user in

(18)

put) and mixedinitiative (i.e. both the user and system can take the initiative), all having their own advantages and disadvantages [37]. The advantage of userdirected is that the user can say whatever it wants to the system, creating the feeling of a natural conversation. On the other hand, the system might be prone to errors because it cannot handle and understand all con versation topics. This problem can be overcome by using a systemdirected approach. While constraining the user’s input, less errors will be made because the user has to behave according to the system’s expectations. At the same time, this creates a less natural experience. Some middle way between the user and system directed strategies, is the mixedinitiative strategy, where the system can guide the user but where the user additionally can start new topics and ask questions. However, the user can still say anything he wants, requiring a sophisticated ASR and SLU component. The COUCH project takes a mixedinitiative strategy. The user can take the initiative first by choosing one of the coaches to talk to. Then the coach starts the conver sation that the user can guide by choosing one of the provided options. This approach with text balloons makes the input restricted. The user cannot ask whatever he wants, but it can make choices in the dialogue.

Confirmation strategies deal with uncertainties in spoken speech understanding. Two types

of confirmation strategies exist: explicit (i.e. the system takes an addition conversation turn to explicitly ask for confirmation) and implicit confirmation (i.e. the system integrates part of the previous input in the next question to implicitly ask confirmation with its next question). The former confirmation has as disadvantage that the dialogue tends to get lengthy and interaction less efficient. The latter is more robust to this problem, but can cause more interpretation errors when the user did not catch the implicit confirmation request. In the COUCH system there is no need for a confirmation strategy since it is programmed to handle the restricted input. No speech or free text is involved.

2.3.4 Response Generation

Response generation (RG) is the process following up the DM’s response decision. The con versational interface has to determine the content of the response and an appropriate method to express this content. The content can be in the form of words or it can be accompanied by visual and other types of information. The simplest approach is to use predetermined responses to common questions. RG is commonly used for SDSs to retrieve structured information (e.g. Who is the king of the Netherlands?). This involves translating translating the structured information retrieved from a database into a form suitable for spoken responses. RG is more complex for systems like the Google assistant and less for a system like COUCH, which has relatively simple response generation. In the COUCH application, most responses are prescripted, using simple lookup tables and template filling. A template is an abstract schema that is narrowing down the coaching domains in generic information of interest and information belonging to these entities of interest [39]. These templates can by dynamically filled with information about the user or the interaction, where the predetermined goals and the progression are taken into account. For example, “gebruiker synchroniseert fitbit stappen, systeem geeft feedback”.

2.3.5 TexttoSpeech Synthesis

Texttospeech synthesis (TTS) transforms the words generated in the response generation into spoken speech. TTS is closely related to ASR, since both systems need to accurately work to gether for a speechbased conversational interface to function effectively. The totext processed speech input needs to be transformed to spoken system output, appropriately responding to the meaning of the user’s input. This process is named TTS and has improved in quality over the past years. TTS is used in applications where messages cannot be prerecorded but have to be synthesized in the moment [2].

(19)

2.4 Limitations of Conversational Interfaces

Although ASR technology is useful in a wide range of applications, it is never likely to be 100% accurate. One big difference between written language and spoken language is that spoken language is much more spontaneous compared to written text. Written text is grammatically correct, while spoken speech often is not. Complexities regarding the user characteristics of elderly, conversational mechanisms (i.e. the processes of turntaking, grounding and conver sational repairs), dialogue structure and speech input variations make the recognition of spoken language a complex process [32]. The limitations of conversational interfaces and its expected effects on the COUCH system are discussed in this section. Additionally, a suggestion on how to deal with the problem is given for every limitation.

2.4.1 Conversation Mechanisms

The first complexity of conversational interfaces is in the conversation mechanisms. Compared to textbased applications, conversation mechanisms such as turntaking, grounding and con versational repair, are much more complex to implement. In the case of the current COUCH project, users have to choose between several multiple choice textinput options. This elimi nates the need for any conversational repair and it simplifies grounding. The computer system always understands the user and when the user does not understand the computer, it can ask for repetition or more clarification via one of the prewritten input options. In case of implement ing speech to this application, this process will become more complex. The ASR system can misunderstand or not identify the spoken speech input.

2.4.2 Naturalness of Speech

The second limitation of conversational interfaces relate to the naturalness of speech. COUCH takes a hierarchical dialogue structure (see Section 5.5 for a detailed description) where the dialogue follows a structured path based on the user input. When implementing speech to this dialogue system, the choices in the hierarchy can still be made by the user, but with spoken input. The system should be able to deal with multiple input sentences. For example, when a user asks about the positive effects of physical activity, the question can be phrased like:

• ”Can you tell, eeh ..., tell me about the positive effects of physical activity, please?” • ”Why do I need to exercise more?’

• ”What are the advantages of exercising?”

• ”I am not into, .. I mean, don’t like to be physically active, so why should I?

One way to handle the speech is by retaining the multiple choice structure. In the this way the ASR could ”guess” which option is chosen by the user, based on its speech input. In the example above, the system has to understand that the option ”positive effects” is chosen with each of the example inputs and that it has to mention the advantages of being active. However, this approach is complex since then the system has to understand every user input, also input which is not related to the coaching topics. An easier approach is to provide the user with a restricted list of input sentences, but this decreases the naturalness of the interaction. In both cases more dialogue is necessary and the dialogue structure will change, but to what extent depends on the approach.

2.4.3 Speech Input Variations

The third and another wellknown experienced problem of conversational interfaces is the prob lem of handling speech input variation. Variations make it hard for the NLU component to accu rately interpret the semantics of natural language, causing the speech recognizer to incorrectly

(20)

interpret the speech or not recognize the speech at all (i.e. increasing the WER). This variation may be due to several factors such as age, speaking style, accents, emotional state, tiredness, health state, environmental noise and the microphone quality [40]. A few of these factors are considered important for current project and will be elaborated on a bit more.

Environmental Noise is one of the big challenges in ASR since it interferes with the speech

recognition of the user’s voice. An experimental setup in a complete controlled laboratory en vironment can obtain very promising results for ASRs, while at the same time, using the same system in reallife home setting can substantially increase the WER. Especially in home settings where there is a lot of background noise, as for example in houses with big families, children running around and music and television in the background. In general, elderly’s houses are less noisy then these family houses, reducing the risk of ASRs performance deterioration due to noise. However, elderly often suffer from hearing deterioration, causing more environmental noise by e.g. the radio or television. One simple method for conversational interfaces to deal with noise is by providing the user with feedback about the noisiness in the environment. Simple feedback messages like, “Sorry I cannot hear you, it is too loud”, can dramatically improve the user’s experience because it shows understanding oh the issue [41].

Speech characteristics of elderly are very different from younger adults in multiple ways, caus

ing to increase the WER from ASRs to be significantly higher than for adult voices [42]. First, elderly have to do with a naturally ageing of the voice. The characteristics of an aged voice have found to be less easily recognized by standard ASR systems since these are often re searched and designed for the majority of the population, using acoustic models trained with speech of young adult speakers [28, 40]. Second, literature suggests that a large segment of the elderly population has experienced a past or present voice disorder [43]. People suffering from dysarthric speech or any other voicerelated health problem, have tended to achieve lower ASR performance with the commercial applications [35].

2.4.4 Expectations

The fourth limitation of a speechbased application is that it can raise the expectations of the system [44]. The coaches from the council are not very smart, and for this reason they are designed as cartoon characters, communicating via text balloons. When the user can talk to the application, the users might expect the system to understand everything he says, also topics which are not related to the coaching. Research suggest that especially elderly more often use everyday language and their own words to formulate commands, even when explicit instructions including the required input are given [45]. When the system does not understand all of this, the user can experience more negative feelings towards it, leading to an avoidance of the system in the worst case scenario. In this case, the coaches are not able to maintain a longterm relationship with the users and cannot provide coaching anymore. However, if the characters will be employed in a speechbased system, cartoonlike characters are suggested. This is because cartoon images will lower customer expectations toward the skills of the characters, and match the technical abilities of the system [46]. Thus, the problem of high expectations might be overcome by keeping the coaches as they are, like dumb cartoon characters.

2.4.5 Longterm Engagement

The fifth limitation relates to the user’s interest in a system. Although this limitation is not specif ically for speechbased systems, but also for other technologies, it is important to consider that users might get bored when the system outputs exactly the same sentences multiple times. This can lead to interactions that become repetitive over a longer period of time. This was a problem mentioned in [47], for example. When using text balloons in the cartoonish version of COUCH,

(21)

this repetitiveness might be less because it is more like reading a comic strip. However, in case of COUCH, there was too little content described, which already caused repetitiveness with the textbased application. This makes it hard to say if a finished textbased version of COUCH causes boredom among the users. An elaborated description of the user experiences with COUCH and the health effects it achieved is described in 5.7.

2.4.6 Privacy Issues

The last complexity of conversational interfaces pertains to privacy issues. With the rise of technologies, the difficulty in warranting user’s privacy increase, especially in the case of con versational interfaces. Ethical and legal questions arising here include what data is collected, who has access to it, how long the data is stored and where and what such data is used for [48]. The COUCH system is used in a safe home area where private conversations regarding phys ical, but also mental health, take place. This data has to be stored in the system in order to construct personal coaching strategies.

(22)

Chapter 3 SPEECH RECOGNIZERS

With the rising interest in conversational interfaces and other speechrelated applications and interfaces, many ASRs have hit the market. Some of them are for commercial use, while others are made available for any person interested in speech recognition.The open source systems can be improved by allowing users to train existing systems and obtain more data from this. This section attempts to give insights in speech recognition systems used in the past and thereby, pose an answer to subquestion 4 and contribute to the answers to subquestion 5.

3.1 Closed Source Speech Recognizers

Most commercial speech recognition systems are closed source. The bigger companies like Apple and Google prefer to keep their advancements for themselves to be the first on the mar ket or they offer paid software to get money out of it. In this section I will describe the existing closed source or paid speech recognition systems and their related inhome agents, which have been mentioned shortly already in Section 2.1.

Apple’s Siri1is released in 2010 as an app for iOS and in 2011 it was integrated into the iPhone 4S by Apple, making it the “oldest” commercially available voice assistant. The second avail able voice assistant was Microsoft’s Cortana2, released in 2013, followed by Amazon’s Alexa3, which was launched in 2014 with its Echoconnected homespeaker. Google was the last one to announce the Google Assistant4in 2016. This voice assistant came with its home speaker and an embedding in the Google app, available for smartphones running on Android. Each of these famous virtual assistants comes with its own speech recognition technology, generally not available for the wide audience. Besides its Assistant, Google has released a webbased speech recognition API, in collaboration with Openstream. This JavaScript library covers both speech analysis and speech synthesis and allows developers to create voicecontrolled Chrome addins and websites [49]. This is, to my knowledge, the only industrial cloudbased speech recognizer with a freely accessible API.

A previous study evaluated the performance of ASR systems for TV interactions in several do mestic noise scenarios [50]. These noise scenarios are designed from inhome settings, as for example having the sound of a coffee machine in the background. A score calculating the similarity between the spoken input sentence and the resulting ASR output has been used to evaluate the performance of each of the ASRs. Compared to the Microsoft Bing Speech API (used for Microsoft’s Cortana) and the Nuance ASR, the Google ASR proved to be the most

1_{https://support.apple.com/nl-nl/HT204389} 2_{https://www.microsoft.com/en-us/cortana/}

3_{https://www.amazon.com/Amazon-Echo-And-Alexa-Devices/b?ie=UTF8&node=9818047011} 4_{https://assistant.google.com/}

(23)

robust with the highest recognition precision in general, but also in noisy environments [50]. Other research supported the superiority of Google [51]. Audio recordings from different corpus of read speech were used to calculate the WER of three speech recognition systems. Results showed that Google outperformed Microsoft and the opensource toolkit Sphinx4 with a 9% WER, compared to 37% WER for the Sphinx4 and an 18% WER for the Microsoft Speech API [51]. Especially the finding that the Google ASR was most robust to background noises in an inhome setting makes the Google API promising for current research. Since it is Google’s speech recognition web service that is behind the Web Speech Recognition API, this recognizer offers promising results. However, it is not straightforward to use the Web Speech API for other purposes than user input [49]. For this reason, the most use cases with the Web Speech API are very simple webbased applications, such as interfaces for making todolists [49]. An appli cation like COUCH might be too complex to use this API. Additionally, as mentioned in Section 2.4, warranting privacy and security is an important issue for COUCH. Commercial cloudbased ASRs might not be the best choice in general because of its possible strict userlicense. An other disadvantage of the Google Web Speech API is that it is underused and therefore there is a lack of clear documentation. When searching for this API, you are often guided to the general Google Speech recognizer, which is paid software.

Another less famous vendor of both cloudbased and local speech recognizers is Nuance Com munications [52]. However, this software is not freely available and turns out to have the worst performance (in terms of WER) out of the Google ASR and the Bing Speech [50].

3.2 Open Source Speech Recognizers

There exist many open source speech recognizers, but not all of them are extensively re searched in literature or available in Dutch. Open source speech recognizers that are most present in the literature include the Hidden Markov Model Toolkit (HTK), the CMU Sphinx, the Kaldi Toolkit, and the RWTH ASR. Although not extensively described in literature, the NL Spraak toolkit is described in this section as well, because it is the only free toolkit explicitly supporting Dutch. Research evaluating speech recognizers with the Dutch language is not very common and for this reason, results obtained from the HTK and CMU Sphinx and Kaldi toolkit, are based on English speech. Only the results obtained from the NLSpraak toolkit are based on Dutch speech [53].

HTK [54] is one of the most popular open speech recognizers (based on hidden Markov mod

els), consisting of a set of libraries and tools written in C. It provides a good documentation. However, the time spent to set up, prepare, run and optimize the toolkit is very long and it re quires extensive knowledge compared to Sphinx and Kaldi [55]. Additionally, support for the Dutch language cannot explicitly be found. It might be optional when training the speech rec ognizer with Dutch data, but this is not the focus of this research.

The Sphinx toolkit [56] consists of several versions and packages for different tasks and ap

plications such as versions Sphinx2, Sphinx3, Sphinx4 (most common version), and pack ages Pocketsphinx, Sphinxbase and Sphinxtrain. Sphinx is a hidden Markov model speech recognition system written in JavaScript. It includes different tools for system analysis and also several example implementations of simple, but stateofthe art techniques. In their own words: “Sphinx4 has permitted us to do some things very easily that have been traditionally difficult” (p. 9) [56]. However, in research using audio recordings from the field of telecommu nications and the TIMIT corpus to calculate the WER for the Microsoft API, the Google API and the Sphinx4 [51], Sphinx4 achieved a WER of 37%. This measure of performance was much worse compared to the commercial Microsoft API (18% WER) and Google API (9% WER). An

(24)

other problem making this toolkit unsuitable for the purpose of this this research, is the missing support for the Dutch language.

Kaldi [57] is a toolkit written in C++ and it provides a system based on finitestate transducers. It

compares itself with toolkits as HTK and RWTH, but Kaldi implies to have more modern, flexible and cleanly structured code then HTK and more open license terms than either HTK or RWTH ASR [57]. A study comparing opensource speech recognition toolkits tested Kaldi, Sphinx and HTK in terms of the performance (i.e. WER), and the time spent to set up, prepare, run and optimize the toolkits [55]. Kaldi outperformed the Sphinx and the HTK toolkit in each of these terms and even being no expert, the provided recipes and scripts enable the usage of all these techniques to the user in short time. The traditional Kaldi toolkit does not support the Dutch language, but the NLSpraak toolkit, which is based upon Kaldi, does.

NLSpraak is a toolkit for Dutch speech recognition. The toolkit is based upon the Kaldi toolkit

(mentioned above). The NLSpraak toolkit performed quite well in terms of WER compared to other systems. Research attempting to recognize broadcast news in 2016, resulted in a WER of 19.4% for the NLSpraak system, which was very close to the best tested system from 2008, which obtained WER of 17.8% [53]. It is important to take into account that in general the recog nition of news fragments tend to be easier then recognizing speech from a Skype conversation, for example. This toolkit is freely available, including the necessary documentation and it is available for Dutch.

RWTH ASR [58] is a toolkit written in C++ developed by the RWRG Aachen University. It imple

mented state of the art speech recognition technology with possibilities for speaker adaptation and includes a comprehensive documentation, example setups for training and recognition and tutorials [58]. However, the system currently only supports Linux and Mac OS platforms and it is a more complex toolkit compared to some of the others.

According to Gaida et al. [55], some of the other less popular opensource systems and toolkits are Segmental Conditional Random Field Toolkit for Speech Recognition (SCARF) [59], SRI Decipher [60], Juicer [61], SHoUT speech recognition toolkit, and Improved Atros (iATROS). These systems will not be elaborated upon because they are not as wide spread, documented or easy available as the previously mentioned systems [55]. Other opensource systems are MIT’s WAMI toolkit [62], the Bavieca ASR toolkit [63], and Julius [64]. However, not much infor mation and documentation can be found about these systems either, so they are not considered for this project.

Due to the main goal involved, the considered ASR was chosen based on the following criteria: their performance, ease of use, documentation, availability of the system and language support in Dutch. Based on these criteria, the choice has fallen for the NLSpraak toolkit. This toolkit wins on ease of use, availability of the system and language support. There is not much research in its performance, but since the toolkit is build upon the Kaldi toolkit that has shown to perform quite well, we can assume NLSpraak’s performance to be solid for this project.

(25)

Chapter 4 RELATED WORK

As described in Section 2.1, ECAs are computergenerated characters with an embodiment, used to achieve humanlike and more engaging interactions. Several ECAs have already been developed in virtual coaching systems in the healthcare domain to assist users making ap propriate healthrelated decisions. The purpose of this chapter is to give an overview of these stateoftheart virtual coaching systems and their experienced advantages and disadvantages. This evaluation partly answers subquestion 5 by mapping pitfalls in existing conversational in terfaces.

4.1 Kristina

KRISTINA [3] is a knowledgebased conversational agent who provides healthcare advice, as sistance and social companion for elderly. Additionally, two other use cases of the KRISTINA agent have been designed to address the needs of two other types of users: care providing personnel and migrants not familiar with certain health and basic care issues and the sani tary system of the host country [65]. The agent is composed of different modules that ensure multimodal dynamic conversations with users. The system includes an ASR component for the speech recognition and a TTS component for the spoken surface output. For its nonverbal appearance, KRISTINA is realized as an ECA through a credible virtual character.

The application contains different functionalities. First, the user can choose a scenario from a predefined list, although not every scenario is available in every language. Based on this scenario, users can converse with the virtual character by speech, although a backup option for typed text is provided in case the conversational agent does not understand the spoken speech. The project underwent two iterations, with added features like a larger array of topics to talk about and more depth within the topics in the second iteration. Examples of the characters from the two iterations are shown in Figure 4.1.

(a) Prototype of the first iteration (b) Prototype of the second iteration Figure 4.1: KRISTINA prototypes [3], captured from (http://kristina-project.eu/en/)

(26)

The final prototype provides a wide range of information as well as a more advanced com munication [65]. The system is able to generate proactive responses and dialogues to have more everydaylike conversations. The agent can ask for a direct or indirect clarification when it detects more than one relevant topic to the use’s input. KRISTINA scored high on points as trustworthiness, friendliness and professionalism. On the other hand, the negative point of KRISTINA was her rigid, depersonalized and uncompassionate character.

The design and behavior of the KRISTINA agent has been improved in the last prototype to match the application’s content and scenarios to make it look more natural [65]. However, the gestures and facial expressions were considered as being too rigid and for this reason did not evoke empathy. The same holds for the voice: it is considered to be monotonous. With the user evaluations, the ASR could not be tested sufficiently by all subjects due to technical problems. One last major issue is the system latency, which is perceived as too long for a natural dialogue. These problem needs to be tackled in future systems, together with the processing of personal data, personal identification features, and personal information. Data protection measures are important for addressing the concerns of the users. Both these strengths and weaknesses should be considered in the development of a speechbased COUCH application.

4.2 Meditation Coach

The Meditation Coach [4, 66] is an ECA developed to guide users trough a mindfulness med itation session and help them relax. A picture of the meditation coach is shown in Figure 4.2. The coach is made interactive by recording and processing data from a breathing sensor in the dialogue system. Participants felt the system was more effective than a videotaped meditation instructor in reducing anxiety. This finding was supported by the significantly stronger respira tion regulation as measured by their respiration rate during meditation. The virtual coach was significantly more interactive and adaptive to their breathing and the users appreciated that it afforded tailored feedback. The virtual coach was inhaling and exhaling in the rhythm of the participant’s breath and it provided feedback about the pace of the breathing (e.g. ”continue breathing at a slower pace”). These personal breathing instructions were based on the partici pant’s measured breath duration and breathing rate. These results indicate that implementing a coach embodied as conversational agent can achieve the coaching goal more effectively then a nonembodied conversational character.

Nevertheless, participants were significantly more satisfied watching a video of a human medi ation instructor. A major source of displeasure with the virtual coach was the lacking humanlike features, including its synthesized voice. A human voice is often preferred over a synthesized voice, and this is even more important for meditation applications. Limitations of the study in clude the lack of evaluation in a real context over an extended period of time and short meditation sessions that might not be effective for more advanced users.

(27)

Figure 4.2: The Meditation Agent. Figure reproduced from [4]

4.3 Exercise Advisor

The exercise advisor [47, 67] is part of the FitTrack system which has been developed to in vestigate the ability of relational agents to establish and maintain longterm, socialemotional relationships with users, and to determine if these relationships could be used to increase the efficacy of health behavior change programs. The relational agent plays the role of exercise advisor that participants talk to about their physical activity. It is designed to be used on home computers on a daily basis. The agent uses synthesized speech and synchronized nonverbal behavior, but the user contributions to the dialogue rely in selecting items from multiple choice menus, dynamically updated based on the conversational context.

Two different versions of the exercise agent exist. A first system designed to work as exercise coach in general (see Figure 4.3a) and a second system that was adapted for the elderly pop ulation (see Figure 4.3b). This second system was designed to be easy to use, with a very consistent and intuitive user interface and an enlarged display area to accommodate visual impairments and it also contained an additional selfmonitoring graph and educational content page, temporarily replacing the ECA.

(a) Initial FitTrack interface with exercise advisor. Figure reproduced from [67]

(b) New FitTrack interface with exercise advisor. Figure reproduced from [47]

Figure 4.3: FitTrack interfaces

Both evaluation studies of the exercise advisor system demonstrated the acceptance and us ability of a relational agent by elderly. Participants found interacting with the agent to be rela tively natural and this had a positive impact on the users’ perceived relationship with the agent.

(28)

Lastly, the studies were successful in demonstrating the efficacy of the agent to walk more. Although promising results for achieving health goals by using ECAs has been proven in both studies [47, 67], some contradictory results were found. In the primer research [67], there was stated that deploying conversational interfaces does not imply that natural language understand ing must be used. The dynamic menubased approach used in the FitTrack system provided many of the benefits of a natural language interface, such as naturalness and ease of use. As additional advantage, it is not necessary for the system to rely on errorprone understanding of unconstrained speech input. However, in the followup research [47], participants mentioned that they could not express themselves completely using the constrained, multiplechoice in teraction. When asked, participants universally said they would have preferred speaking to the agent, rather than using a touch screen. Although replacing the touch screen with ASR would have been welcomed by all participants in this latter study, Bickmore et al. [47] men tioned that for future work available systems first need to be thoroughly evaluated to ensure that they could provide high enough reliability given the variability and differences in voice qual ity in older adults. This is necessary so that users can engage in richer conversations and more freely express themselves while maintaining the ease of use of the multiplechoice selection input modality. This challenging undertaking will be the focus of current research.

4.4 Inhome social support agent

The inhome social support agent [5] is a remotecontrolled companion agent used in homes of the elderly. The research used a remote Wizard of Oz system (see Figure 4.4), controlled by a research assistant, responding by choosing preselected utterances and animation commands, or by manually typing utterances that were transmitted to the agent for realtime synthesis and animation. With this approach they found high levels of acceptance and satisfaction with the in home social support agent, with many participants stating that that it felt as a social companion. Elderly would like to tell stories and discuss the weather, their family and future plans with virtual companion. Participants spent most time on storytelling, indicating that this would be valued and utilized by elderly. Additionally they found that important discussion topics for elderly support include activity planning, attitudes towards aging, and social ties, which require nuanced dialogues. These dialogue topics are in line with the topics included in the COUCH project, although the COUCH project does not support any storytelling due to the textbased limitations. When implementing speech in the COUCH application, this storytelling and small talk should considered in order to build a relationship with the coaches.

Figure 4.4: WizardAgent Setup for inhome social support agent for elderly. Figure reproduced from [5]

(29)

Chapter 5 THE COUNCIL OF COACHES PLATFORM

In this chapter, the Council of Coaches application is elaborated. This is necessary for this re search topics because the implementation of speech will change some fundamental parts of the COUCH application. It is important to understand the background of the coaching since more dialogues have to be added. Furthermore, by adding speech, the dialogue structure and inter action patterns will change and the WOOL dialogue platform is going to be used for the changes in the dialogues. First, an explanation is given about the theory behind and the design of the coaches, their background and their coaching domains. This is followed by a stepbystep guid ance through the application. Then, the dialogue style motivation is explained, which is followed by the dialogue structure, including the coaching actions and content. In order to get more in sights in the current user experiences with the application and its strengths and limitations, an expert interview has been conducted with a researcher from the RRD. The objective to have this interview and the results obtained from it are described in Section 5.7. In the last part of the chapter, the WOOLplatform is explored, which is a dialogue manager tool developed throughout the project in order to easily manage dialogues. This chapter contains information that contributes to the answers of subquestions 2 and 3.

More information about the COUCH application can be found in the COUCH deliverables1.

5.1 The Coaching Domain

The COUCH project started with identifying all coaching domains that could potentially be of interest for the application. They used four different project sources, from which they summa rized all mentioned domains, resulting in 28 potential domains. Since these domains were partly overlapping, they first removed the overlap between very similar domains from the four project sources, leading to the reduced set of 10 domains as given below:

1. Physical Domain 2. Cognitive Domain 3. Mental Domain 4. Social Domain

5. Quality of Life Domain 6. Nutrition Domain

7. Smoking Cessation Domain 8. Alcohol Consumption Domain 9. Chronic Pain Domain

10. Diabetes Domain

(30)

Since 10 domains were still too much to cover in the project, they narrowed down the inclusion of coaches to be used. In the next section, we will describe the final list of coaches together with their background and design.

5.2 The Coaches

One of the objectives of COUCH was to develop the coaches as interesting characters, each with their own characteristics and backstories.The definition of these backstories and charac teristics has primarily been a creative process and is driven by the feedback received in various enduser evaluations. This resulted in the set of coaches as shown in Table 5.1. More infor mation about the coaches’ backstories can be found in deliverable [6], where an elaborate de scription of each coach is given, including their height, weight, place of birth, likes and dislikes, backstory, role, pointers for dialogue writing and coach selection blurb.

Role Name Nationality Gender Age

Physical Activity Coach Olivia Simons Dutch Female 52

Nutrition Coach François Dubois French Male 45

Social Coach Emma Li American Female 28

Cognitive Coach Helen Jones British Female 64

Peer Support Carlos Silva Portuguese Male 67

Chronic Pain Coach Rasmus Johansen Danish Male 33

Diabetes Coach Katarzyna Kowalska Polish Female 45

Table 5.1: The seven coaches from the Council

5.3 The Council of Coaches Interface

To start using the application, users have to create an account, so his preferences and informa tion can be stored, and dialogues can be personalized. This action allows the user to ’enter’ the living room in the main menu, which is shown in Figure 5.1. After pressing the button ’create account’, Coda will guide the user through all the necessary steps to create this (see Figure 5.2). Coda is equivalent of an inapp ’menu’, and is designed to help with more technical func tions like the creation of an account, logging in or out and changing settings. The account creation procedure includes demographic questions, such as: Gender, name, age, educational level, technology affinity, and health literacy. After these demographic questions, the user is presented with an optional healthintake. This intake can automatically recommend the appro priate coaches and includes: Diabetes type 2 diagnosis assessment, chronic pain diagnosis assessment, and selfassessment questions regarding the physical activity, nutrition, cognition and social domain.

After completing the account creation and health intake, the user is provided with a selection screen and mechanism for selecting his coaches. As shown in Figure 5.3, the user is provided with a recommendation of coaches, based on his health intake. However, the final choice which coaches to select is for the user. The only exception to this are Rasmus and Katarzyna (the chronic pain and diabetes coaches respectively). These coaches can only be altered by the user when the answers to related to the diagnosis of these indeed indicate this disease.

(31)

Figure 5.1: Council of Coaches Main Menu screen. Figure reproduced from [1]

Figure 5.2: First screen of the account creation process, guided by Coda. Figure reproduced from [1]

After the user selected his preferred coaches, he will arrive in the COUCH living room, which has already been shown in Figure 1.1. There the user can click on any coach he wants in order to start a dialogue and receive personal coaching. The user is always guided by Coda and can fall back on him in case the user has questions. In the living room, two widgets can be found, namely a radio to add some background music and a physical activity book widget. This latter widget (as shown in Figure 5.4) is introduced to enable the user to interact with the coach and at the same time receive additional information about the user’s activity level.

Opportunities and Challenges for Adding Speech to Dialogues with a Council of Coaches