A framework for developing rule-based stateful goal-driven dialogue systems

(1)

F

ACULTY OF

S

CIENCE

A

FRAMEWORK FOR DEVELOPING RULE

-

BASED STATEFUL

GOAL

-

DRIVEN DIALOGUE SYSTEMS

Author

Gerben van der Huizen

gerbenvanderhuizen@gmail.com 10460748

University supervisor

Ana Oprescu

Contact person

Frank Smit

Host organisation

OBI4wan

August, 2018

95 pages

MSc Software Engineering Master thesis The Netherlands Science Park 904 1098 XH Amsterdam

(2)

Abstract

The dialogue management component of dialogue systems decides, based on user input, what kind of response will be returned. State-of-the-art approaches to dialogue management are mostly AI-based, in which param-eters for making decisions are derived directly from real dialogue data. Although this approach is versatile in handling different types of conversations, its advantages can prove to be a weakness in an industry which prioritises predictability. A rule-based approach which provides both full control to developers and lowers the amount of data required, could be better suited for completing the goal-oriented conversations which are preva-lent in for instance the customer service domain. This thesis presents a framework for building rule-based goal-oriented dialogue systems with dialogue management functionality completely controlled by predefined rules and constraints. The design of the framework is supported by a set of requirements elicited from stakehold-ers from the dialogue system development domain. During the research, two of the architectural components proposed by the framework were further designed and integrated in a rule-based dialogue system development platform as a case study. The results of the integration demonstrates how these features can be designed to be compatible with a rule-based approach and still maintain the predictability and data-indifference principles. Keywords

Dialogue system, Rule-based, Goal-oriented, Customer service, Software architecture, Agent development, Conversa-tional agent

(3)

3.1.1 General requirements. . . 17 3.1.2 Front-end . . . 17 3.1.3 DS engine. . . 19 3.1.4 Rest API . . . 19 3.1.5 Multi-processing . . . 19 3.1.6 Database . . . 19 3.1.7 DevOps . . . 20 3.2 Design motivation. . . 20 3.2.1 Feature comparison. . . 20 3.2.2 Feature design . . . 21 4 Process meta-model 23 4.1 A common language . . . 23 4.2 Meta-model design . . . 24 4.3 Meta-model instantiation . . . 25 4.4 Summary of findings . . . 26

(4)

5 State-tracker 27

5.1 Git-based state-tracker design. . . 27

5.2 Committing to a state-tracker branch . . . 28

5.3 Branching and merging states. . . 28

5.3.1 Creating and updating state branches . . . 29

5.3.2 Merging state branches . . . 29

6 Case study rule-based dialogue systems: OBI4wan’s RuDSDev 32 6.1 DS developer front-end . . . 32

6.1.1 Front-end tasks . . . 33

6.2 DS engine . . . 34

6.2.1 NLU and NLG . . . 34

6.2.2 Rules and inferencer . . . 35

6.3 DevOps . . . 35

6.4 Back-end and REST-API . . . 35

6.5 Processing. . . 35

6.6 OBI4wan’s RuDSs . . . 35

6.6.1 Fallback. . . 36

6.6.2 Completing goal-oriented conversations without state tracking . . . 37

7 State-tracker evaluation 38 7.1 Evaluation strategy . . . 38

7.2 Example customer assistance conversations . . . 38

7.2.1 Retain state example category . . . 39

7.2.2 State branching example category . . . 39

7.2.3 State merging example category . . . 42

7.3 The test framework . . . 43

7.3.1 Test elements . . . 43

7.3.2 Test and test step Hoare triples . . . 44

7.3.3 Test example description . . . 45

7.4 Soundness of the experiments . . . 46

7.4.1 Motivation for using Hoare triples . . . 46

7.4.2 Program unit behaviour as Hoare triple functions . . . 46

7.4.3 Hoare triples for reasoning about higher-order stores . . . 47

7.5 The evaluation RuDS . . . 48

7.6 State-tracker experiment descriptions. . . 49

7.6.1 State retaining experiment . . . 50

7.6.2 State branching experiment. . . 52

7.6.3 State merging experiment . . . 55

7.7 Experiment execution and results . . . 57

8 Discussion 58 8.1 Value of the proposed software architecture . . . 58

8.1.1 Requirements elicitation threats to validity . . . 59

8.2 The value of the State-tracker. . . 59

8.2.1 Evaluation threats to validity . . . 60

8.3 Process meta-model preliminary evaluation . . . 61

9 Related work 63 9.1 Dialogue systems . . . 63

9.2 State tracking . . . 63

9.3 Requirements engineering . . . 64

(5)

10 Conclusion 66 11 Future work 68 Acknowledgements 69 Bibliography 70 A Stakeholder profiles 73 B Interview protocol 76 B.1 Interview introduction. . . 76

B.2 Interview script for experienced developers . . . 76

B.3 Interview script for novice developers . . . 77

C Interview example transcripts 79 C.1 Interview transcript: DS developer . . . 79

C.2 Interview transcript: User assistance specialist . . . 83

D Functional and non-functional requirements 86 E Dutch transcripts of the example conversations 91 E.1 Fallback example . . . 91

E.2 Retain state example . . . 91

E.3 Branching state example 1 . . . 92

E.4 Branching state example 2 . . . 92

(6)

Chapter 1

Introduction

In recent years the market of virtual assistants for commercial purposes has been growing fast due to the progress made in the field of Artificial Intelligence (AI) [44]. An increasing percentage of companies are willing to invest in dialogue systems such as chat-bots that can perform tasks in a customer service environment by means of a conversation [26,42,16]. However, state-of-the-art developments in the field of dialogue systems focus mostly on researching end-to-end solutions (i.e. the black-box approach) for dialogue management [44], while companies prefer to analyse and control the decisions these kind of agents make during a conversation (i.e. the white-box approach)1.

In general, dialogue systems or conversational agents have two essential software components which together manage how a conversation progresses [29]. One of these components is called the interpreter, which converts user messages to information a dialogue system can understand. Interpreters use Natural Language Understanding (NLU) techniques to detect important words and classify what the intent of the user is. The other component can be referred to as the policy-maker or inferencer of the dialogue system [8], which decides what action or response should be executed based on information interpreted from user input. White-box and black-box approaches are two generic categories under which we can classify most of the existing algorithms and techniques applied to the dialogue management.

Rasa is an example of a black-box approach dialogue system, which consists of the Rasa-Core and Rasa-NLU components [8]. The Rasa-Core component decides which activity the system will perform next (i.e. the policy-maker or inferencer), and Rasa-NLU interprets the messages from the user (i.e. the interpreter). Rasa-core’s probabilistic model can be trained in various ways using supervised learning (e.g. with neural networks). The decision making of the system is entirely based on the data it is trained on, and can be altered through interactive learning sessions. A significant amount of data is needed to train AI-based inferencers such as Rasa-core, and simply generating this data is difficult, because real conversations do not follow a specific grid or structure. As a consequence, it is likely the data will be collected while performing real conversations, which requires a lot of time and effort. Another disadvantage is the loss of control for the dialogue management part of the system, because the AI determines what activities will be performed by the system. This loss of control decreases the predictability of the system both internally for developers and externally for users, because whether or not an dialogue system performs a certain action is based on a confidence score calculate by an AI algorithm. Maintaining control over the decisions of a dialogue system is important in some domains (e.g. customer service), because the extra control helps to avoid damaging the brand of a platform through rogue agents or second-class service [41,29]. A white-box approach to dialogue management, which allows for setting constraints or rules on the decision making of a dialogue system, might be a better alternative in cases where predictability and control are important factors.

An example of a white-box approach is a rule-based dialogue system described in Webb (2000) [58]. The paper argues that the ability to customise rule-sets for the inferencer allows developers to have more control over dialogue management of conversational agents. Webb (2000) explains that the rules themselves can be used to tailor a dialogue system to work in different domains and they do not require conversational data to manage agent decisions or actions. Furthermore, the rule-based approach enables dialogue system developers to design and build the dialogue management of an agent based on their domain knowledge. However, the developed rules need to be robust against unexpected input and manipulation. According to Xu and Seneff (2010), rule-based systems are typically good at solving goal-oriented conversations such as question and answering (Q&A) sessions [60]. When the correct information is interpreted by the NLU component, a rule-based dialogue system can quite easily satisfy user goals by inferring which steps need to be

(7)

taken after each user response. We argue that it is more beneficial to implement a rule-based approach when tailoring dialogue systems to the environment and context of goal-oriented conversations.

Platforms such as Rasa provide the tools and instructions to develop dialogue systems. The research papers and documentation concerning both Rasa core and Rasa NLU describe all of the software components involved with dialogue system development, dialogue management and NLU for Rasa dialogue systems. A similar framework for developing rule-based goal-oriented dialogue systems has not yet been designed or documented, as was determined by studying literature on rule-based approaches. Although a rule-based approach for a dialogue system development framework would contain similar software components to the framework described by Rasa (e.g. dialogue management and NLU components), inherently the rule-based components will be different. For example, a rule-based inferencer component would require a knowledge-base containing rules or constraints instead of an AI-based model trained on conversational data. Furthermore, a rule-based development framework could either omit software components needed for a AI-based approach or include entirely different software components.

1.1 Research question

The main goal of this study was to find out how to design a rule-based approach for developing dialogue systems, which is able to effectively complete goal-oriented conversations. The dialogue systems developed with the approach should have predictable dialogue management, which does not need to be trained with conversational data. Moreover, we aimed to design rule-based solutions for an existing development platform as a case study, to show how certain state-of-the-art software components can be designed to be compatible with our rule-based development framework. We argue that specifically in the domain of customer service and user assistance bots, where conversational agents only have to perform a few specific tasks, a rule-based dialogue system will be able to perform its task up to the standards and provide the required control and predictability. Therefore, the research question addressed in this work is the following: RQ: How to design a framework for building rule-based goal-oriented dialogue systems?

Before we could answer the main research question we first needed to establish the requirements and domain for a development platform for building rule-based goal-oriented dialogue systems. Next, based on the established requirements we can determine what kind of features or components are needed to realise a system for building dialogue systems. Furthermore, state-of-the-art research for developing the established features will likely suggest AI-based solutions, so the next step is to find out how we can adapt these features to be compatible with a rule-based approach to maintain the predictability factor. Therefore, RQ expands into the following sub-questions:

RQ a: What are the key requirements for a platform for developing rule-based goal-oriented dialogue systems? RQ b: What kind of features or components could satisfy the established requirements?

RQ c: How can we design these features to be compatible with the rule-based approach, and thus maintain predictability and data-indifference for dialogue management?

1.2 Terminology

Throughout the rest of the report we will use specific acronyms to refer to dialogue system related terms. For the terms dialogue system and conversational agent we will use the initialism DS, and the acronym RuDS is used to refer to a rule-based goal-oriented dialogue system. For a dialogue system development platform we will use the acronym DSDev and RuDSDev refers to the rule-based goal-oriented variant of a development platform.

A conversation between user and DS consists of multiple turns, each a single contribution to the entire dialogue. The dialogue can be seen as a game in which I take a turn, then you take a turn, then me, and so on. A dialogue or conversation turn consists of one of the involved parties responding with a sentence or a few words. An instance where both the user and DS have completed a turn will subsequently be referred to as user dialogue system turn or UDST, and a single dialogue utterance as turn or conversation turn. An entire conversation between user and DS is referred to as a UDS conversation.

RuDS have a limited amount of possible utterances they can use to communicate with humans. Moreover, whether or not an utterance can be used is often dependent on how the rules of a RuDS have been designed. We refer to this structure or design as the conversation flow. The conversation flow of a DS determines what options are available to the RuDS and the human user to complete a dialogue through various constraints, actions and utterances.

(8)

1.3 Approach

The research was conducted in three phases or dimensions: the design or requirements engineering phase, the development phase and the evaluation phase.

In the design phase, requirements (i.e. user needs and concerns) were elicited and analysed for a RuDSDev. Requirement elicitation took place at a company specialised in building RuDSs for customer service assistance. Based on the established requirements we designed a software architecture for a RuDSDev, which allows DS developers to build RuDSs. Furthermore, two of the proposed architectural components were designed in detail, to function within a rule-based approach. These particular components were chosen to be further developed due to their significance in literature and how they were valued and prioritised by the stakeholders.

In the development phase, the architectural components designed in the previous phase were built. Any software components developed within this phase were thoroughly tested before moving on to the evaluation phase.

In the evaluation phase, we performed experiments to determine if the developed architectural components for a RuDSDev allow it to better satisfy the established requirements. The evaluation aims to prove that the architectural components can be used to develop effective RuDSs and improve the rule-building process while maintaining a rule-based white-box approach.

1.4 Contributions

This research presents the following contributions:

1. A list of requirements for a RuDSDev, which were established based on information elicited from DS developers, system developers and DS product owners with less development experience.

2. A framework for a RuDSDev, which enables DS developers to build RuDSs which excel at completing dialogues by backtracking through a rule-base. The architecture serves as a high level guideline for how to structure and design a platform for developing these types of DSs.

3. The design and evaluation of a rule-based solution for tracking the context of a conversation as the dialogue progresses.

4. The design of a high level model-based language for enforcing more standardised and modular design of RuDSs.

1.5 Thesis outline

The content of the thesis is organised as follows: firstly, Chapter2describes our requirements engineering efforts during the design phase of the project. Secondly, we describe the RuDSDev framework and its architectural components in Chapter3through4. Thirdly, the methodology for the evaluation and the acquired results are presented in Chapter7. Finally, in Chapter8through10we summarise and discuss the general results and possible extensions of this research.

(9)

Chapter 2

Requirements elicitation and analysis

In order to determine what kind of features or software components are needed for a RuDSDev, we first needed to establish the requirements for such a system. We used Requirements engineering (RE) techniques to research a system used for developing RuDS as a case study. The stakeholders of the case study system were consulted for the requirements elicitation process of the research. A detailed description of the system can be found in Chapter6of this research paper.

As far as we could find during the research, there were no requirement documents for RuDSDevs available to the public, so we had to establish and document these requirements without any baseline. How these requirements were obtained, documented and analysed is described in this chapter of the thesis. In the following sections we describe how our analysis resulted in a high level software architecture for a RuDSDev.

2.1 Requirements sources

Two main requirements sources were used to elicit requirements related information about RuDSDev. The first source consisted of system stakeholders, which are "individuals, groups or organisations whose actions can influence or be influenced by the development and use of the system whether directly or indirectly"[43, p. 127]. By eliciting information from these stakeholders (e.g. by interviewing or surveying them), their concerns can be addressed and requirements based on their needs can be defined. The second source used for eliciting requirements related information was existing reading material on DSDevs; e.g. articles, research papers, system documentation and product descriptions or specifications. Next to DSDevs such as Rasa we also examined platforms for developing spoken DSs, because these type of systems also require the same NLU and dialogue management components as written DSs [55]. Furthermore, we studied software architecture literature, to establish some standard but key technical requirements for software applications in general. The literature was primarily used to validate some of the claims made by the stakeholders we had consulted, and thus strengthening the validity of the elicited requirements.

2.2 Stakeholder identification

It is important to identify all stakeholder groups to help ensure everyone who may be affected by the software is consulted during the elicitation phase [51]. At an early stage in the design phase, we established that the stakeholders of the system can be categorised into the following groups: system developers, DS developers and members of the organisations which are involved in building or implementing DSs (e.g. product owners). However, it was not possible to interview all identified stakeholder groups, due to limited permission to access representatives of organisations for our case study. As a consequence, to obtain information about the needs of organisations we settled for interviewing specialists who communicate these needs on daily basis to developers (e.g. consultants and intermediaries).

The stakeholder identification methods from Sharp et al. (1999) were used to establish a baseline of stakeholder groups. The baseline groups consisted of users (i.e. DS developers), developers (i.e. system developers) and organisations (i.e. DS product owners with less development experience). Users who converse with DSs are not seen as stakeholders of a development platform, because they only interact with the DSs themselves and do not use the system directly. The DS developers involved in our research were system developers as well, so it was not possible to gain

(10)

information from stakeholders which were exclusively part of either the user or developer group. To clearly identify the differences between each of the baseline groups, three stakeholder profiles were created. These stakeholder profiles helped to identify the concerns of each stakeholder group, and solidified the roles of each group with regard to a DSDev. The profiles, listed in appendixA, were created early on in the project and updated iteratively as the design phase of the project progressed.

2.2.1 DS developers

The "user" stakeholder group consists of the DS developers who use the a RuDSDev to develop RuDSs. Their work mostly consists of designing and producing rules for controlling the decision making of a RuDS and training its NLU component. For example: A DS developer could be asked to build a RuDS for the airport domain, in which people ask questions such as: "Do you know if my flight has been delayed?" or "Can I bring my phone on the airplane?". The developer first needs to collect and analyse data examples to get an overview of the type of questions which are frequently asked in a particular domain. Based on the data analysis results the DS developer is able design a RuDS which can complete a few specific tasks. For the airport domain the developer could design a RuDS, which can answer questions about departure and arrival times, informs passengers about opening and closing hours of shops on the airport, or provides information about parking spots around the airport. The NLU component of a DS uses a trained AI-model to tag or classify a conversation with a certain goal or category. Hence, another task of the DS developer is to train the NLU component of a DS with conversational data. Questions such as "Do you know if my flight has been delayed?" have many variations which essentially have the same intent (e.g. "Is the flight to ... delayed?" or "Will my flight be late?"). DS developers have to ensure that the data examples used for training the NLU component represent the questions which a RuDS has to answer.

2.3 Brainstorming sessions with DS developers

Brainstorming sessions with stakeholders are often used to find ideas for solutions or requirements in RE [40, p. 2]. Brainstorming alone is not sufficient to find complete requirements for a system, but in conjunction with other techniques such as interviewing, it can be an effective method for acquiring meaningful information about stakeholder concerns [32, p. 16]. Two brainstorming sessions were held with DS developers to find problems they encountered with a RUDSDev and to acquire some initial ideas for requirements and potential improvements. Furthermore, the brainstorming sessions aided with establishing an initial list of concerns regarding the development process for RuDSs. The following topics were discussed during the brainstorming sessions: (1) what kind of problems are you experiencing with the development platform and DSs, (2) what are the main goals of the system, (3) which roles does each stakeholder fulfil, and (4) what are some of the potential functional and non-functional requirements for the system.

2.3.1 Brainstorming sessions findings

From the brainstorming sessions we extracted a list of features and concerns suggested by the stakeholders who participated. This list was continuously updated throughout the design phase and helped to establish some of the documented requirements listed in appendixD. The list of key findings from the brainstorming sessions is listed below: • Lack of standardisation when developing the rules for RuDSs (e.g. no enforced Domain Specific Language) is a concern when introducing the system to new developers. If there is no generic development process, it becomes difficult to enable developers to clearly define rules modules and establish a separation of concerns within rule-sets. Rules with the same type of functionality can be structured differently in each new project, so there are no clear instructions on what is considered to be a clean method for writing rules.

• Designing and developing an entire knowledge base is difficult to understand for novice developers who have a specific conversation flow in mind. If the process of designing and building rules is too complex it can be difficult to create a simple conversation flow for novice developers.

• RuDSs which have to answer questions with many possible answers have an exponentially growing conversation flow. As a result, RuDSs are not effective at handling long conversations with complex questions, because this requires a large amount of rules to be written.

(11)

• If a RuDS only measures the goal of a user at the start of a conversation, it will not be able to detect if the user goal changes at a later stage. A RuDS will sometimes need to deduce information about the user’s goal as the dialogue progresses, which is not possible with single-turn intent detection. Conversations with humans are dynamic and often include sudden switches in intent or goal.

• There needs to be some feature which allows DS developers to visualise and monitor the decision making of deployed DSs. Visualisations would allow DS developers to better explain the behaviour of their RuDSs to people with less technical knowledge about DS development.

• The development platform should allow developers to test DSs before they are deployed. The important components of a RuDS which should be tested are the dialogue management and the NLU intent classifier. • Some algorithms for NLU and dialogue management require a lot of data before they can function properly. In

general, conversational data can not simply be generated, because real conversations do not follow a specific grid or structure. Obtaining enough conversational data seems to be a recurring problem when building DSs. • Some standard technical requirements which are important from a software engineering perspective, were

suggested by the DS developers.

– The development platform itself requires a solid DevOps structure with tools for continuous integration and testing.

– An automatic deployment system for new versions of the DS code-base is required to deploy software updates automatically to all released builds.

– Users of a development platform need some kind of authorisation for accessing the DSs they have built. Unauthorised users should not be able to access data for which they have no permission.

– If some of the core parts of the system are written in the Python programming language (e.g. NLU and dialogue management), it will be difficult to manage a lot of computation heavy requests at the same time. Although Python provides a lot of libraries for enabling machine learning and NLU, the programming language is notorious for having multiprocessing problems because it only supports executing one thread at a time with its Global Interpreter Lock (GIL)1. Hence, CPU multiprocessing in Python is not very efficient and regularly causes tasks to be delayed when a resource consuming training process or other long-running back-end task is executed in a web-based framework2. It is likely this issue could degrade the performance of a DSDev for all of its users when performing computation heavy tasks.

2.4 Interviews with stakeholders

Requirements elicitation emphasises the use of various techniques to gather information about user needs and domain requirements. One of these techniques is the interview; i.e. a tool for gathering knowledge throughout the RE process [3]. Next to serving as a method for eliciting knowledge and experience from domain experts and other end-users, interviews are also used to verify facts and clarify ambiguous system requirements. The informal nature of social interactions makes the effectiveness of interviews greatly dependent on the quality of the interview questions and skills of the interviewer. RE analysts prefer the use of interviews because it enables a software engineer to gather in-depth and comprehensive information from the interview participant [3]. The goal of the interview process was to find out what kind of software components are needed for a RuDSDev, in order to enable both IT experts and novices to build the DSs they want to build. Furthermore, the interviews were used to confirm already established requirements for a RuDSDev.

2.4.1 Interview questions

The main script of the interview consisted of structured questions, that is to say a set of predefined questions for acquiring quantitative data from interviewees. Generally, the responses to these questions are short in nature, so there is no exploration possible on individual responses. Structured questions can be both open-ended and closed-ended and are usually delivered in the same format to all the interviewees. To allow for elicitation of in-depth technical

1

https://wiki.python.org/moin/GlobalInterpreterLock

(12)

requirements, we used a different interview script for experienced developers than for stakeholders with no or little experience with software development. Both interview scripts focus on extracting information about features needed for the development process of RuDSs, but for novice developers we included less technical questions to account for their inexperience with software development. From our established stakeholder groups, DS developers and system developers were used to represent experienced developers, and the inexperienced developers such as product owners and intermediaries were used to represent novice developers. Both interview scripts are described in appendixB.

Next to structured questions we also asked unstructured probing questions, to encourage the interviewee to provide us with in-depth information. The probing questions were used to clarify specific details, and also to uncover new information hidden in the knowledge-base of the interviewee. Examples of probing questions are "Would you give me an example?"and "Can you elaborate on this idea?". An effective probing question helps to get a person to talk about their personal perspectives and experiences, and promotes critical thinking. The goal is to challenge assumptions and open up discussion, but the interviewee must always have the feeling the interviewer is trying to help, and does not question the participant to win an argument [28]. Probing allowed us to clarify interesting and relevant issues raised by the interviewees, and also to elicit more complete information from them. The combination of structured and unstructured questions resulted in a semi-structured interview, which is known for its ability to extract both qualitative and quantitative data from interviewees [31].

2.4.2 Interview process

Each interview started with an introduction in which some easy questions were asked to introduce the interviewees and to allow them to become familiar with the setting. As was stated in the previous chapter, the core of the interview consisted of structured and unstructured questions. The responses to structured questions are short in nature, while unstructured questions allow more room for exploration of responses. All of the interviews were audio-taped and notes were taken during the process.

In total, five different stakeholders were interviewed during the span of the research, each participant had varying levels of IT expertise (i.e. novice to expert). Before the interview process started, the participant had to read, agree with and sign a confidentiality document. The interview itself was expected not to take longer than 30 to 40 minutes. The main goal of an interview was to find out what DS developers and potential DS developers need to build the RuDSs they want to build. Information was elicited about what the interviewee needs or what improvements are needed to be able to develop RuDSs with a development platform. Transcripts from an interview with an experienced DS developer and a novice DS developer are included in the appendixC.

2.4.3 Interview key findings

The interviews with DS developers verified some of the findings from the brainstorming sessions. Three requirements were consistently highlighted during the interviews with DS developers: the effectiveness of a rule-based approach, the importance of context tracking, and the ability to monitor and test RuDSs on the development platform. The developers emphasised the importance of predictability when completing goal-oriented dialogue and how a rule-based approach allows them to retain full control over the conversation flow, as described in the following quote from a developer:

"In situations where the bot does not only have to manage the conversations, but certain conditions have to be checked as well, API calls have to be made, and values have to be extracted from a database. In these cases it is relatively simple to design a rule-based bot which can handle all these tasks dynamically. So you have full control over which events happen and when."

Furthermore, the importance of tracking the context of a conversation was mentioned several times throughout the interviews, because it allows DSs to maintain multiple themes throughout dynamic conversations. As for the development platform itself, it is important for the users of the system to be able to monitor the actions and performance of DSs. Monitoring can potentially enable developers to improve a DS iteratively as they analyse its performance through visualisations and statistics.

In the novice developer interviews we mainly wanted to find out what stakeholders with less programming experi-ence need to build RuDSs on a RuDSDev. Based on these interviews we determined that developers with no experiexperi-ence require some kind of guidelines or support from other developers before they feel comfortable with developing a DS by themselves. They are willing to learn how to program for the purpose of building a DS, but using some kind of

(13)

high level language would also be acceptable to them. Once a RuDS is deployed, the developer should have the means to monitor its decisions, intent classifications and response utterances. We also found that novice developers are not always convinced whether or not the rule-based approach is the correct approach for their domain (i.e. customer service and user assistance bots), as seen in the following quote:

"I think as much of the process as possible should be AI-based and that we should use data to train the bots instead of using rules. I believe that an AI-based bot will always be better than a rule-based bot if it is set up properly. Setting up and training an AI-based bot can be difficult, so I can imagine that sometimes the developers choose to build a rule-based bot."

2.5 Apprenticing session with a DS developer

An apprenticing interview was performed with a single DS developer according to techniques described in Beyer and Holtzblatt (1995) [6]. In apprenticing sessions the learner (interviewer or researcher) tries to learn from the master (DS developer) by observing the execution of a task and forming his or her own understanding of how this task should be performed. The learner can actively participate by asking questions in order to find the hidden reasoning behind certain ways of working. Contextual inquiry methods such as apprenticing are known to minimise bias caused by suggestive questions and expectations, by placing the participant in the context of the questions. The goal of this interview was to gain insight into the work of DS developers by defining the steps involved with the process of building RuDSs. From these development steps we can derive what kind of features a development platform would need to support DS developers in their work.

From the participating DS developer we learned that the design of the conversation flow of a RuDS is usually based on the human-to-human conversational data made available to the developer. The DS developer needs to analyse the conversation flow from human-to-human conversations, because this will help the developer understand what kinds of actions a RuDS will need to perform to help users. If the data contains a lot of conversations with a specific topic as subject, the DS developer will search for a generic way to resolve all of these conversations. For example: If the data contains a significant amount of example conversations about finding parking spots then the developer could design the following conversation flow: (1) confirm if the user’s question concerns parking spots, (2) ask where precisely the user wants to find a parking spot, and (3) inform the user where and if there are parking spots available at that exact location. The development process witnessed in the apprenticing session was performed based on the data received from a particular company which enlisted the DS developer to build a DS. In the next chapter we provide a generic description of the development process observed during the session.

2.5.1 Development steps for building RuDSs

The DS developer described every step of the DS development process and some of the issues he encountered. What follows is a generic description of the work which a DS developer performs during the design and development phase of a RuDS.

1. The first step involves acquiring the conversational data, which is usually supplied by the client who wants to build a DS. The acquired data consists of entire conversations between two human actors, in which one of the participating human actors has a question or request related to a particular domain, and the other human actor has the knowledge or resources for answering the question from the other actor.

2. To get the data in a suitable format for analysis the developer uses a data formatting script. The features contained in this new format allows the developer to quickly go through the first sentences of the conversations, from which the DS developer can already tell the subject of the conversation and thus the dialogue intent of the user. The features of this formatted data are the following: (1) the first utterances from both human actors, (2) the entire conversation formatted with tagged actor responses, (3) the identifier for the conversation, and (4) the date and time on which the conversation took place.

3. The third step involves annotating the formatted conversational data. The annotation process consists of going through the data and placing each conversation in a certain intent category. These intent categories are created by the DS developer and stem from the domain for which the RuDS will be built (e.g. web-shops could have

(14)

a "Product not delivered" category). The DS developer made an assumption based on experience that tagging about 150 to 200 conversation examples would provide a sufficient first impression of what kind of conversations the new RuDS will have to complete. Sometimes a single tagging phase is not enough for a DS developer, because in a later analysis new intent categories are established. These new categories will likely be more fitting for the domain or will be able to create a more manageable conversation flow. Other DS developers will often evaluate whether or not they agree with the analysis of the data, which can lead to functionality being added or removed in later stages of the development process.

4. The next step requires the developer to analyse the annotated conversations to determine what kind of actions and responses should be performed by the RuDS for each intent category. If the developers are unsure which actions the DS should take in a certain situation, they can usually consult the product owners or domain experts and ask them how they want to handle these cases. This happens mostly when outliers or special cases are found within the data. Furthermore, the developer needs to explore the services provided by the client for which the RuDS will be built, and if it will be possible to integrate these services within the functionality of a RuDS. For example, some RuDSs have to work with external APIs, which requires some extra development effort during the rule building phase.

5. The developer creates a business case for the early design of the RuDS and presents it to the client or product owner. If the case is approved by the client the developers will start building the designed RuDS.

6. In the development phase the DS developer starts with designing the conversation flow for the established intent categories. Each intent category receives a decision tree model which shows all possible actions and paths a RuDS can take to complete a conversation. Based on the designed conversation flow the DS developer can define the behaviour of the RuDS by developing rules which the dialogue management component has to follow. Sometimes a new RuDS will have to perform new actions such as access information through an API. The system developers can add functionality to the DS engine to allow the RuDS to perform new actions.

7. Another step in development requires the DS developer to categorise intent sentence examples based on the annotated conversations. An intent sentence example is often represented by the first sentence of an annotated example conversation. For example: the first sentence of an annotated conversation could be "How do I get a refund for this product?", this particular sentence helps the RuDS identify the refund intent. By training the NLU component on recognising these intent sentences the RuDS can identify the intent of the user when a similar sentence is detected. According to the DS developer, a first deployed version of a DS with about three distinguishable tasks would likely need about 500 to 1000 intent sentence examples to classify sentences with an acceptable accuracy (around 70%), depending on the quality of the examples. With just a few data examples some goals are difficult to distinguish.

8. The next step involves testing whether or not the dialogue management and intent classification perform as expected. The developer often switches back and forth between testing and development due to issues or bugs encountered during testing. In general, the developer has to write tests for both the rules and the intent classification of the newly developed DS. The rules are tested by evaluating if the inferencer can infer the correct response or action once a rule has been called. These kind of tests are mainly executed by launching dummy data into the knowledge-base of a RuDS. Testing intent classification accuracy is performed by evaluating if the correct intents are classified by the NLU component. These tests also help finding the optimal detection threshold for the NLU algorithm.

9. After the essential functionality of the RuDS has been tested, the DS developers starts preparing the DS to be deployed on a certain platform. This phase is used to add some quality features to the RuDS such as proper greeting responses and making sure the system can handle responses in both English and Dutch if needed. 10. Once the development and testing phases have been completed, the RuDS is deployed on the platform of the

client.

11. Most of the developed RuDSs require additional development time after deployment, due to changing requirements of companies or functionality not working as intended. In some cases the developed RuDS is first deployed on a work-flow platform, on which they can be tested with real conversational data without being released on the main platform (i.e. not accessible to real users). This extra development time also allows DS developers to add more intent sentence examples for training the NLU component.

(15)

2.6 Requirements documentation

For documenting requirements which were elicited during the project, we used a documentation style inspired by Volere’s Requirements Specification Template (VRST) [49]. VRST was used as a guideline for writing down most of the individual requirements. The template was slightly altered by making it more focused on describing how each requirement is relevant to the domain of goal-oriented conversations (i.e. Q&A types of conversations) and thus to our research. The list of elicited requirements, with VRST documentation, is described in appendixD. The full list of items included in our version of the documentation template is the following: (1) an identifier used to refer to the requirement (e.g. R-01), (2) the requirement type and sub-types, (3) a sentence or statement which encompasses the intent of the requirement, (4) sources of the requirements (e.g. stakeholders or literature), (5) an explanation of how the requirement is relevant to a system for developing RuDSs for completing goal-driven conversations (e.g. Q&As), and (6) the average priority rating assigned to the requirement by stakeholders.

After we had fully documented all of the established requirements, we performed a short survey among the interviewed stakeholders. In the survey the stakeholders could indicate how they would rate the priority of each documented requirement. The scale of this rating was from 1 to 4, 1 being very low priority and 4 indicating that the requirement is highly prioritised by stakeholders. Our goal was to confirm if we had elicited and documented the requirements which stakeholders found to be of high priority to implement in a RuDSDev. A low average priority rating would indicate that stakeholders think a requirement does not address their priority concerns or needs. If the survey participant thinks the requirement is beyond their expertise they could indicate this as well by marking the Not Applicable (NA) option. If a requirement was marked as NA by participants this was indicated in the VRTS documentation of the requirement. The results of the short survey showed that only requirement R-13 received an average priority rating of below 2.5 (i.e. a low priority rating). However, the deployability requirement R-13 was not discarded due to the importance of implementing and maintaining a robust continuous integration structure for any software application [35]. Therefore, we verified that the stakeholders from the interviews agree with the majority of the documented versions of the established requirements.

Requirement types and sub-types were assigned based on how they are defined in Bass et al. (2003) [4]. Functional requirements define how the system must behave at run-time. Non-functional requirements define qualifications for the behaviour of the system. Non-functional requirements can have a sub-type / refinement (e.g. usability, scalability, reliability, availability, interoperability, security, etc.).

Requirements R-01 Functional, (further specification: state tracking)

R-02 Non-functional, effectiveness (further specification: certainty of delivery) R-03 Non-functional, performance (further specification: quality of response) R-04 Non-functional, usability (further specification: ease of use / standardisation) R-05 Non-functional, modifiability (further specification: multi-domain)

R-06 Non-functional, modifiability (further specification: multi-purpose) R-07 Functional (further specification: predictability / freedom from risk)

R-08 Non-functional, interoperability (further specification: web-based application) R-09 Non-functional, testability (further specification: unit testing)

R-10 Non-functional, security (further specification: integrity)

R-11 Non-functional, availability (further specification: up-time monitoring)

R-12 Non-functional, performance (further specification: multi-process, concurrency) R-13 Non-functional, deployability (further specification: new version deployment) R-14 Functional (further specification: DS responses)

R-15 Functional (further specification: intent classification)

Table 2.1: A summarised and less detailed list of all elicited requirements. A more detailed list of requirements can be found in the appendixD.

(16)

2.7 Requirement conflict analysis

Any software project can fail when key requirements are not correctly managed and conflicts between these requirements are ignored [10]. A requirement conflict usually arises when two requirements either contradict each other, cause inconsistencies, or create unwanted inter-dependencies [33]. We recognised the importance of performing requirement conflict analysis before attempting to find a solution for satisfying the requirements in question. Therefore, a conflict analysis matrix was created as described in Robertson (2012), to identify any conflicts between the elicited and confirmed requirements [50]. Each of the elicited requirements was compared in the conflict matrix from Figure2.1. Identified conflicts between requirements were indicated in the matrix with a cross sign. During our analysis of the conflict matrix we assumed that two requirements are in conflict with each other if implementing the corresponding solution to one requirement prohibits from implementing the other.

Figure 2.1: An illustration of a matrix for identifying conflicting requirements. In the image, there is a requirements conflict between 3 and 7. Consequently, if we implement a solution for requirement 3, it will have a negative effect on the ability to implement a solution for requirement 7 and vice versa. The image is originally from Robertson (2012) [50].

For some of the requirements additional expertise was necessary to identify if potential conflicts were possible during implementation due to the technical nature of these requirements (e.g. security and deployability requirements). Therefore, developers were consulted when their technical expertise was required for the analysis. The following potential requirement conflicts were identified using the matrix method (as displayed in Figure2.2):

(a) R-01 Functional (state tracking) potentially conflicts with R-04 Non-functional, usability (b) R-04 Non-functional, usability potentially conflicts with R-06 Non-functional, modifiability (c) R-10 Non-functional, security potentially conflicts with R-12 Non-functional, performance

Conflicts (a) and (b) arise due to risk of reduction in flexibility potentially caused by reducing the complexity of and standardising the development process (R-04). RuDs are flexible in how the conversation flow can be written or solved. Therefore, it is up to the developer and their domain knowledge to set the constraints for what tasks a RuDS is able to perform. Standardising development could lead to oversimplification of the rules and a reduction of flexibility. The flexibility of the rules allows RuDSs to potentially be able to manage other types of process than message responses (R-06), and it allows DS developers to come up with different solutions for the conversation flow. Flexibility in the design of the conversation flow is one of the strengths of a RuDS (R-01). One of the universal principles of software design states that as the usability of a system increases, its flexibility decreases [30]. A solution for satisfying R-04 would likely require the code structure of an entire system to be altered to enforce the standardised approach, in which case the developers should carefully consider conflicts (a) and (b).

(17)

Figure 2.2: The conflict matrix created by applying conflict analysis on the elicited requirements.

As the system grows and scales its capacity to manage thousands of developers, keeping the data of each of these users secure will become more difficult (c). Satisfying security requirements has been shown to have a negative effect on computation efficiency [15, p. 54], due to the difficulty of adding security functionality while balancing the efficiency of the a system [14]. This kind of conflict makes it difficult to scale the amount of processing units needed to manage computational tasks as the system grows. A DSDev requires developers to regularly perform computation heavy training procedures for the NLU component, which could be difficult to optimise as system developers continue to add security functionality to the system.

These potential conflicts might not become issues depending on what kind of solution is designed to support the requirements. For example, a solution which supports standardised development (R-04) could very well support development of RuDSs which can handle a variety of different process types (R-06). These conflicts were considered to be warnings, and played a significant role during the design of a solution that would satisfy the key requirements.

2.8 Summary of findings

In the requirements engineering phase we established a set of 15 functional and non-functional requirements by performing various requirements elicitation techniques. The entire set of requirements was analysed to identify any conflicts between requirements which could potentially hurt the system during or after development. In the next chapter we present a software architecture based on the established requirements and propose several architectural components.

(18)

Chapter 3

Solution design

In the design phase of the project we used various requirements elicitation techniques to gather information about user needs and domain requirements. In this chapter we propose a software architecture for a RuDSDev based on the established requirements from the design phase. Furthermore, the proposed software architecture provided us with a high level representation of potential features which a RuDSDev would benefit from. In Chapter3.2we describe our motivation for further developing two features from the proposed architecture.

3.1 Software architecture proposal: RuDSDev

The proposed software architecture for the RuDSDev consists of several components, each communicating with one another to ensure the smooth running of the platform. The architectural components of the framework have been connected to the requirements which they aim to satisfy. Requirements R-02, R-05 and R-06 have been categorised as general requirements and are not represented in the architecture, because fulfilling these requirements involves multiple architectural components working together. A visual representation of the architecture is shown in Figure3.1, which allows us to highlight the various services of the system and review the flow of information across components. The proposed architecture is not an ideal representation of a RuDSDev, but rather describes what is minimally needed to satisfy the stakeholders of such a system.

3.1.1 General requirements

The fulfilment of requirement R-02 is difficult to represent within the architecture, because it is fully dependent on how RuDSs are built by developers. There are always user queries which a RuDS will not be able to process, due to these being outside of their task scope. The development platform should at least enable development of RuDSs which can remove a significant amount of work-load from their human counterparts. However, it is entirely dependent on the type of RuDS, the domain and the domain knowledge of the DS developer whether or not this requirement can be satisfied, because the development platform only supplies the tools to do so.

To satisfy requirement R-05 the developed RuDSs are required not to be bound to a single domain. DS developers should be able to add example conversational data and utterance templates of their domain of choice. Furthermore, how the rules themselves are written should also be based on the domain knowledge of the developer. Whether a RuDS is an expert on web-shops or airports should be completely dependent on how it is designed and built by DS developers.

The rule-based inferencer is goal-oriented, which indicates that the rules determine what type action a RuDS will perform based on certain constraints. These actions are not limited to conversing with users, but can also involve tagging user messages or simply inform users about a certain topic. The ability to enable RuDSs to perform other tasks next to responding with messages allows us to fulfil R-06 with our proposed architecture.

3.1.2 Front-end

For standardising development of RuDSs (R-04) we suggest enforcing a meta-model language for RuDS design at the front-end of the system. A meta-model forces DS developers to search for and introduce generic solutions to DS-based problems, instead of building rules with a different structure for each new RuDS. We argue that it is easier

(19)

Figure 3.1: A visual representation of our proposed architecture for a RuDSDev.

to gain consensus about abstractions like "class" and "relationship" with other developers than about domain specific operational data. By enforcing a high level language we enable DS developers to design generic modules which can be re-used when building new RuDSs. Furthermore, the generic modules allow developers to generate the frames of certain rules or rule-set, which can make it easier and quicker for novice DS developers to start the development process.

DS developers should be able to test if a RuDS is responding correctly to certain user input (R-09), and whether or not it can interpret the correct goal from user sentences. An interactive method for testing both of these conditions could be implemented by enabling DS developers to chat or communicate with the RuDS they have developed at the front-end of the platform. However, this would require DS developers to retry all test-cases once a change is made to a RuDS. Hence, this is only a viable method for quickly checking if certain input is correctly interpreted by the RuDS. Another option is to set up a test framework in which DS developers can build test-cases to evaluate specific types of conversations completed by a RuDS. Each time the developer changes the rule-base, they can run the test framework for a specific RuDS and thus check whether or not all the test-cases still pass after the change was made. New test-cases can always be added to the test framework when the developer desires to add new functionality to a RuDS. This test framework is very similar to a software engineering framework for unit-testing or integration testing.

As specified in by R-10, the system should be developed with security in mind and respect should be kept for the privacy and anonymity of its users. Users should only be able to access the areas of the system to which they have been granted access. Developed and deployed RuDSs are only accessible upon granting user access or ownership. Authentication in the form of passwords, one-time passwords (e.g. for resetting a password) and digital certificates, will be used to ensure that users of the platform are who they claim to be. Furthermore, authorisation tokens should be put in place for users so the system knows if the users have permission to modify either data or services.

According to R-11, developers want to be able to monitor RuDSs to check conditions or to validate assumptions made during the design phase. Providing a tool for monitoring DS conditions at the front-end of the system could prevent a RuDS from producing faulty behaviour. The system should thus provide some way to gain information about how a RuDS is performing with either visualisations or just information (e.g. statistics). For RuDSs this information could include among others: the choices which a RuDS made during conversations, the amount of completed conversations (successful and unsuccessful), and the performance of the NLU component.

(20)

3.1.3 DS engine

The inferencer determines what the RuDS should do next based on the state of the conversation and the knowledge-base of the RuDS. The inferencer should be rule-based to allow DS developers to fully control and predict what the RuDS will do given a certain conversation state (R-07 / R-14). Depending on how the rules are written the inferencer can choose to ask for users to repeat there input or ask users for more explicit data. The inferencer also controls under what conditions the RuDS will delegate the problem to a customer service employee. Actions inferred by the component can result in a message being sent to the user and / or a change in the current state of the conversation.

The state-tracker or context-tracker is a component of the DS engine specifically tasked with keeping track of all information extracted from users during conversation sessions (R-01). This component takes information interpreted from the user and stores it in some type of data-structure which represents the current context of a dialogue. Furthermore, the state-tracker also needs to store the interaction history, so RuDSs can return to previous states or topics of a conversation. The latest state of the conversation is communicated to the rule-based inferencer, which can make decisions based on what kind of data is stored within the state-tracker component. The state-tracker mainly receives information about conversation context from user input, but can also change goal based on information inferred by the inferencer component.

The system needs a robust NLU component which enables RuDSs to detect user goals and important entities needed to complete goals. However, to be able to complete multiple goals within the same conversation a state-tracker component alone is not enough. The system needs the ability to detect goals from user utterances throughout the entire conversation, not just at the first user response. A RuDS should understand dynamically when it needs to stop completing a certain goal in favour of a new one (R-15). Detecting intents over multiple UDSTs is achievable with state-of-the-art NLU algorithms, especially for Q&A conversations or other type of goal-oriented dialogue [34]. Once the user input has been interpreted by the NLU component, the information is sent to the state-tracker to update the current state of the conversation.

To ensure quality of response the DS engine includes a non-AI based Natural Language Generation (NLG) component (R-03). This component uses templates, which allows the DS developer to fully control what kind of utterances can be returned by RuDSs. An example of a template response for greeting the user is the following: "Hi, I am a chat-bot. What is your question?". The inferencer component controls when and what template is used based on the state of the conversation and what rules have been fulfilled. These constraints allow DS developers to fully control when and what utterances are used by the RuDS, and thus prevents it from replying with responses potentially unacceptable to users or below a certain level of politeness.

3.1.4 Rest API

Interoperability between software components is the backbone of any system, this is not different for a RuDSDev. Due to the web-based nature of researched DSDevs such as Rasa, we propose a REST API for exchanging meaningful information throughout the RuDSDev via interfaces (R-08). REST interfaces are simple and generic, so any HTTP client can talk to any HTTP server by using the REST operations (POST, GET, PUT, DELETE) without further configuration. Furthermore, REST’s simplicity makes it easier to implement than other similar interoperability interfaces [4, p. 165]. The REST API acts as a mediator for all architectural components involved during system run-time.

3.1.5 Multi-processing

For some long running system tasks, such as NLU training sessions, one does not want to spin up a number of threads or sub-processes on the same machine which runs the rest of the application code (R-12). This will lower the performance of running tasks for other users of the RuDSDev. There needs to be a robust solution for handling a large amount of concurrent requests in the form of multi-processing. The multi-processing component will be in charge of queuing jobs and processing them in the background with workers. This component becomes especially important as the platform grows and a significant amount of DS developers use the platform concurrently. As displayed in the visual representation of the proposed architecture, the workers can be sent out to store and extract information from the database.

3.1.6 Database

The database is a crucial component of the system, because it enables DS and user data to be stored, retrieved and updated when needed. Furthermore, without the database component we are unable to satisfy most of the documented

(21)

requirements, because the features for supporting these requirements require some form of data storage or retrieval. Although the database was never explicitly mentioned as a requirement by the interviewed stakeholders, it is standard for an application to have some type of database.

For a possible implementation of such a database we looked at non-relational and relational database scheme types, which both allow us to model our data with little restrictions. We suggest MongoDB1as viable option for storing user and DS data, which provides a flexible non-relational structure. MongoDB supports features across more diverse data types than a relational database, and plans to support ACID transactions, a technique for ensuring that data remains consistent across the entire database as it is moved around.

3.1.7 DevOps

Testing the system from early on can effectively reduce costs later in the system development cycles, so a solid testing framework is needed (R-09). The proposed architecture includes a DevOps component which presents a basic continuous unit and integration testing framework. Continuous testing allows developers to execute automated tests as part of the software delivery pipeline to obtain immediate feedback on their work. To satisfy the deployability requirement there is also a need to include a continuous integration pipeline within the DevOps framework (R-13). Without automatic deployment system in place, system developers will have to update every deployed version of the system manually if there is a change in the source code. Managing a continuous testing and integration framework allows users of the system to benefit from the regular updates and bug fixes made to the platform by system developers.

3.2 Design motivation

In this chapter we motivate our choice for further designing and developing two of the proposed features from the RuDSDev software architecture. We support our design with a feature comparison of several existing DSDev and through literature research.

3.2.1 Feature comparison

Figure3.2displays a DS feature comparison diagram which compares features from existing DSDevs, and maps the elicited requirements from our research to these features. The goal was to find out what kind of solutions were used by other platforms to satisfy the established dialogue management and RuDS development requirements. Although some of the platforms included in the overview were open source, it was still difficult to gain a complete set of features for each platform. For example, some of the platforms did not provide a detailed description of how dialogue management was implemented within the documentation for their DSs. In order to maintain an as complete as possible overview we extracted some of the information from articles about DSs, which were often written by developers associated with the platform in question2.

Starting from the top of Figure3.2, we found that the ability to develop domain specific DSs is supported by all platforms. However, none of the platforms support multi-purpose DSs, which was determined to be a requirement for a DSDev (R-06). In the field of customer service and user assistance there seems to be a future demand for other types of agents next to DSs (e.g. recommender agents or information tag agents) [42], but none of the evaluated platforms have really caught up to this trend yet.

Most of the platforms included in the feature comparison support a certain type of state or context tracking, which is used to help DSs deal with switching topics and keeping track of context throughout a conversation. For example, Rasa tracks all the events which happened throughout a conversation and allows the DS to return to previously encountered conversation states [8]. Pydial stores a set of multiple states (i.e. hypotheses) and calculates a probability distribution of these states to determine which is the current state of the conversation [55]. We argue that state tracking is a mandatory feature if DSs are to keep track of and satisfy multiple user goals throughout an interaction (R-01).

Features such as intent classification and dialogue management functionality seem mandatory for all types of DSs. Several of the evaluated platforms provide open-source access to code, which allows developers to extend existing dialogue management implementations fairly easy. Based on the elicited requirements it seems important for stakeholders to have control over the actions of DSs (R-07), but they only need this control on a higher level (e.g.

1_{https://www.mongodb.com/}

(22)

Figure 3.2: Comparison of features from existing DSDevs. The requirements relevant to a specific feature are highlighted. through rules or policies) not on a source code level. There is no need for a DS developer to be able to add functionality through an open-source code-base.

To support the usability requirements of their system, some platforms offer pre-built modules, which can be used to quickly build standard DSs from only a few user stories. The need for standardised modules such as pre-built DS frames is also reflected in the elicited requirements (R-04).

3.2.2 Feature design

As described by Xu and Seneff (2010), DSs which use rule-based goal-oriented reasoning do not support any function-ality for switching goals while in the process of completing a different goal. For single-turn RuDSs the classification of a user goal only takes place at the start of a dialogue, which is problematic because a RuDS will sometimes need to process new goals as the dialogue progresses. As documented in R-01, RuDS need to be effective at staying on topic, because it allows them to provide pertinent answers throughout the entire conversation. The results of feature comparison for existing DS showed that existing state-of-the-art systems have adopted some form of state or context tracking. State tracking enables DSs to better manage the conversation flow by re-evaluating earlier discussed topics. Most of the state tracking solutions use machine-learning models or other end-to-end approaches, which require a significant amount of training data. However, there is no guarantee that enough conversational data will be available for every case or client (e.g. smaller companies). As described in Williams et al. (2016), rule-based state-trackers require no data and can still be effective at maintaining the context of a conversation. Moreover, rule-based context tracking

(23)

allows DS developers to maintain the predictability of the system, which was highlighted in R-07. Hence, the first modification we suggest is a rule-based state-tracker component, which uses state branching techniques based on Git3. The state-tracker is in charge of updating and maintaining an interaction state (i.e. the conversation context) based on interpreted user input. The state-tracker always keeps track of the current context of a conversation and can create or merge branches based on occurring topic changes throughout a conversation. The state-tracker component is described in Chapter5.

The second feature or modification we propose aims to address the lack of standardisation during the design and development phases of RuDSs established in R-04. According to stakeholders, standardisation of the development process is needed, because for novice DS developers it can be difficult to understand how rules should be implemented for new RuDSs. Therefore, a meta-model was designed based on process-models from the business processing field, which can be used for designing rules and enforcing a more standardised RuDS development process. During feature comparison, we found that existing designs for other DSDevs propose different methods to standardise the development of bots, such as pre-built intents or policies. The meta-model allows us to create standardised rule modules, which can be used to generate or develop the frames of certain standard rule formats. After determining what building-blocks (i.e. rules) are needed for creating a rule-set, the developer only needs to fill in specific details (e.g. type of response) and connect the rules through pre- and postconditions. The process meta-model is described in Chapter4.

In the conflict analysis from Chapter2we established that an attempt to satisfy both R-01 and R-04 could cause some issues, due to the theory that improving performance has a negative effect on usability and vice versa. However, the meta-model effects the rule-set of a RuDS and thus has no direct connection with the state-tracker; the state-tracker only stores the results of what is inferred with the rules. Therefore, we argue that implementing one of these modifications will not negatively impact the effect of the other.

A framework for developing rule-based stateful goal-driven dialogue systems

F

S

A

FRAMEWORK FOR DEVELOPING RULE

-

BASED STATEFUL

GOAL

-

DRIVEN DIALOGUE SYSTEMS

Author

Gerben van der Huizen

University supervisor

Ana Oprescu

Contact person

Frank Smit

Host organisation

OBI4wan

August, 2018

95 pages

Abstract

Contents

Chapter 1

Introduction

1.1

Research question

1.2

Terminology

1.3

Approach

1.4

Contributions

1.5

Thesis outline

Chapter 2

Requirements elicitation and analysis

2.1

Requirements sources

2.2

Stakeholder identification

2.2.1

DS developers

2.3

Brainstorming sessions with DS developers

2.3.1

Brainstorming sessions findings

2.4

Interviews with stakeholders

2.4.1

Interview questions

2.4.2

Interview process

2.4.3

Interview key findings

2.5

Apprenticing session with a DS developer

2.5.1

Development steps for building RuDSs

2.6

Requirements documentation

2.7

Requirement conflict analysis

2.8

Summary of findings

Chapter 3

Solution design

3.1

Software architecture proposal: RuDSDev

3.1.1

General requirements

3.1.2

Front-end

3.1.3

DS engine

3.1.4

Rest API

3.1.5

Multi-processing

3.1.6

Database