A POMDP Approach to select Tutoring Actions in an Interactive Learning Environment

(1)

A POMDP Approach to select Tutoring

Actions in an Interactive Learning

Environment

Jorn W.T. Peters

10334793

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors Dr. B. Bredeweg Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Dr. F. A. Oliehoek Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam June 26th, 2015

(2)

Abstract

In an educational setting tutors need to infer the knowledge level of a student in order to select the next optimal teaching action using a teaching policy. Partially observable Markov decision processes (POMDP) provide an elegant framework to represent the tutoring problem and find teaching policies to be used by artificial tutoring agents. This approach has been used to construct tutoring agents for Intelligent Tutoring Systems (ITS) that operate on a limited domain. This study takes a step towards leveraging POMDPs to construct coaching agents for an Interactive Learning Environment (ILE). In contrast to ITSs the teaching domain for ILEs is significantly bigger as the learner operates in an open-ended learning environment. Consequently, the tutoring/coaching domain is more complex resulting in larger state, action and observation spaces for the POMDP. As a consequence the complexity of solving the POMDP increases. Specifically, this study proposes a mapping from a coaching problem in an ILE called DynaLearn to a POMDP. This POMDP is used to find a teaching policy which is evaluated using a simulation of the envisioned users. Moreover, modelling assumptions are evaluated as to whether these assumptions result in less effective tutors. The results demonstrate the potential of using POMDPS to construct tutoring agents in an ILE.

(3)

4 The DynaLearn POMDP Model 15 4.1 States . . . 16 4.2 Default Transitions . . . 17 4.3 Actions . . . 18 4.3.1 Teach Action . . . 19 4.3.2 Teach_Subtle Action . . . 21 4.3.3 Teach_Meta Action . . . 22 4.3.4 Do_Nothing Action . . . 23 4.4 Reward Model . . . 23 5 Evaluation 24 5.1 Reduced Teaching Domain for Evaluation . . . 24

5.2 Tutor Effectiveness when using Homogeneous Learner Models . . . 24

5.3 Teaching Effectiveness and Tutor Restrictiveness for Different Student Types 25 6 Results 26 6.1 Tutor Effectiveness when Using Homogeneous Learner Models . . . 26

6.2 Teaching Effectiveness and Tutor Restrictiveness for Different Student Types 28 7 Discussion 28 7.1 Model Design . . . 31

8 Open Challenges 32 8.1 Scaling Model Complexity . . . 32

8.2 Alternative Model Construction . . . 33

(4)

8.4 Knowledge Level Estimation . . . 34

9 Conclusion 34

9.1 Future Work . . . 35

(5)

1 Introduction

When tutoring a student, a teacher must analyse and identify the understanding or knowledge of a student and simultaneously use a teaching policy for deciding the optimal pedagogical teaching action to take next. Although there has been substantial interest in the devel-opment of automated tutors for Intelligent Tutoring Systems (ITS) or other constrained environments/domains (see, e.g., Rafferty et al. (2011); Brunskill and Russell (2010); Folsom-Kovarik et al.(2010)), much less research effort has gone into the development of automated tutors for more complex environments such as Interactive Learning Environments (ILE). In contrast to an ITS, an ILE consists of a learner steered open environment and uses coaching instead of strict teaching. Consequently, this results in a learning environment that imposes much less constraints on the leaner’s actions; however, this also means that the set of possible scenarios that an automated tutor should consider grows significantly.

DynaLearn (Bredeweg et al.,2013) is an ILE in which learners acquire knowledge by constructing and simulating models of the behaviour of systems. This study leverages a probabilistic, decision-theoretic approach in order to compute problem specific teaching policies for DynaLearn. More accurately, a probabilistic representation of the learner’s (latent) knowledge is utilized as state representation in a decision process known as a partially-observable Markov decision process (POMDP) (Sondik,1971). Given a learning goal and several other models that define the decision process, POMDPs provide a framework to compute an optimal teaching policy. Although, POMDPs provide an attractive framework, considerable barriers arise when POMDPs are employed to compute teaching policies using realistic learner models. This study examines the applicability of POMDPs in ILEs.

An empirical evaluation of teaching policies using different learner models is presented. Moreover, the teaching policy computed using the DynaLearn POMDP is evaluated using a simulation that simulates three archetypal learners, namely (1) a weak and insecure learner that makes mistakes with high probability and needs hints from the tutor in order to finish the assignment, (2) a weak and secure learner that makes mistakes with high probability and needs steering by the tutor to finish the assignment, and (3) a strong learner that is unlikely to make mistakes and is often capable of identifying and correcting mistakes himself. Using this simulation an overview of teaching effectiveness and general behaviour of the DynaLearn POMDP is presented.

The rest of this thesis is organised as follows: Section 2 presents the theoretical background and context of the project. Section 3 discussed the architecture that supports the project. Section 4 introduces the DynaLearn POMDP model, Section 5 discusses the evaluation methods and results are presented in Section 6. A discussion of the proposed DynaLearn POMDP and evaluation results is presented in section 7. Section 8 reviews open challenges a conclusions are drawn in Section 9.

2 Background

2.1 Theoretical Background

2.1.1 Interactive Learning Environment and Intelligent Tutoring Systems

Although, both ILEs and ITSs are computer aided teaching systems, they are different in a number of respects. Aleven et al.(2003) refers to an ILE as a computer-based system that consists of a task environment and provides support to the learner. This support may take the form of context-specific hints, feedback, means for reflection or by making available

(6)

relevant (hyperlinked) information via a dedicated space or interface. DynaLearn (Bredeweg et al.,2013) is an ILE in which learners construct and simulate models of the behaviour of systems to acquire conceptual knowledge, i.e., the implicit or explicit rules that form a system.

ITSs refer to a computer systems that rely on techniques from artificial intelligence and cognitive science in order to reason about the learner’s knowledge. Furthermore, using domain models the learner is provided with individualized feedback while solving a problem in a steered context. Examples of ITSs include AutoTutor (Graesser et al., 2005) that simulates the learner-teacher relationship by entering into a conversation using natural language with the learner and Why2 (VanLehn et al.,2002) that asks the learner to write paragraph-long explanations of simple systems which are converted into a representation that enables Why2 to reason about the knowledge of a learner.

ITSs provide the learner with context-specific feedback in order to teach a complex skill by doing. ILEs, on the other hand, present a complex problem to the learner encouraging the learner to seek help via the various help systems that provide rich background knowledge in order to help a learner acquire complex conceptual knowledge (Aleven et al.,2003). Due to the restrictive nature of ITSs the possible learner actions are limited. Because of this the scenarios that need to be considered by an ITS are also limited. This has been leveraged in order to develop tutoring agents by several studies that are discussed in Section 2.2.1. In contrast, ILEs use open ended self-steered contexts that allow a learner to explore a problem using various approaches. Consequently, the set of possible scenarios, and with that the complexity of developing an intelligent coaching agent, increases significantly compared to an ITS.

2.1.2 Formal Decision Making Processes

Finding the optimal teaching policy is a planning problem, that is, given a correct model of the learner, the environment dynamics and a learning goal (reward structure), find the optimal plan. However, as the effect of actions in the educational domain are inherently stochastic, a plan in the form of a sequence of actions does not suffice. Instead, a mapping from situations to actions that describe the behaviour of the agent is needed (Kaelbling et al., 1998).

Environment

Agent

Action State

Figure 1: MDP Agent: the agent has full knowledge about the environment

and interacts/alters the environment with actions. The outcome of actions is stochastic, i.e., multiple outcomes are possible.

One method to find a situation-action mapping is Markov decision process (MDP). MDPs are the basis of POMDPs that are the main interest of this study. An MDP models a

(7)

state of the environment and the output is an action based on the current state. An MDP models a process in which it is assumed that while there may be uncertainty in the outcome of an action, there never is uncertainty about the current state of the environment. Moreover, the next state and the reward that is gained is only dependent on the previous state, this is known as the Markov property, formally an MDP can be described as a tuple hS, A, T, Ri (Kaelbling et al.,1998), where

• S is a (finite) set of all the states of the environment; • A is a set of all possible actions;

• T : S × A → Π(S) is the state-transition function or transition model that maps an action-state pair to a probability distribution over the environment states (note that throughout this report T (s, a, s0₎ _{= P(s}0 _{| s}_{, a) is used to refer to the probability of} ending up in state s0_{after taking action a in state s); and}

• R : S × A → R is the reward function that specifies the immediate reward that is gained after taking action a in state s.

Given a well-defined MDP, a policy π : S → A is computed. The policy π specifies an action to be taken for each possible state, i.e., the policy defines the behaviour of the agent. The optimal policy π∗_{maximizes the reward that is gained over the lifetime of the decision} process. This can be either a finite-horizon optimality that maximizes (1), where rt is the reward at time step t. Or, the infinite-horizon optimality as defined in (2) is maximized.

E       k −1 X t=0 rt       (1) E       ∞ X t=0 γt_r t       (2)

Here the immediate reward is discounted using 0 < γ < 1 such that a priority is given to rewards in the near future. There are many methods for finding π∗_{such as the value-iteration} algorithm (Kaelbling et al.,1998), however, a review of these algorithms is beyond the scope of this section.

One of the main assumptions of MDPs is that the environment is fully observable, however, the environment may not be fully observable in every domain. For example, in the education domain the knowledge or understanding of a student is not directly observable, however, given some evidence one can reason about the student’s knowledge. This is known as partially observability. A POMDP is an MDP that assumes that the environment is partially observer able. Formally, a POMDP is a tuple hS, A, T, R, Ω, Oi (Kaelbling et al., 1998), where

• S, A, T and R form an MDP;

• Ω is a set of observations that the agent can make in the environment; and

• O : S × A → Π(Ω) is the observation function that maps a resulting state and action to a probability distribution over the possible observations (note that in this report O(s0, a, o) = P(o | s0, a) is used to refer to the probability of observing observation o after taking action a and reaching state s0_).

Similar to an MDP, the optimal policy π∗_{for an POMDP is found by optimizing the expected} discounted reward.

As shown in Figure 2, a POMDP agent can be decomposed into two components. The policy, which is computed using a POMDP solving algorithm, is responsible for selecting an action given the current belief state. Furthermore, the state estimator (marked SE in Figure 2) updates the belief state given the previously selected action, observations

(8)

Environment

Agent

SE

π

Action Observation b

Figure 2: POMDP Agent: instead of fully observing the environment a belief

about the environment is maintained. The state estimator (SE) is responsible for updating environment belief (b), whereas the policy (π) selects actions based on the current belief.

and current belief state. Thus, the agent (potentially) modifies the world via actions and subsequently makes observations that influence the agent’s belief of the current state of the world. The belief state is represented as a probability distribution over the set of possible states. This probability distribution provides a sufficient statistic for the history and initial belief of the agent, that is, additional data on the history of the agent will not provide any further information (Smallwood and Sondik,1973). POMDPs provide a framework that enables one to plan under uncertainty of both actions and observations. Moreover, POMDPs treat the acquisition of information equally to actions to alter the environment. Thus, POMDPs are widely applicable in environments that require a balance between collecting and exploiting information.

2.2 POMDP Solvers

Typically when a problem is represented as a POMDP an off-the-shelve POMDP solver can be used the find optimal policies for the problem. However, classic POMDP solve algorithms are non-effective for POMDPs with very large state spaces. However, in many domains each relevant feature in a state can be described by a variable Xi with domain Di resulting in a state space S = D1× D₂× · · · × Dn. Moreover, the actions can be described as the effect on a subset of state variables and observations as causal effect of a subset of the state variables. This is known as a factored POMDP representation (Poupart,2005).

Typically the transition and reward functions of a factored POMDP can be described using a dynamic Bayesian network (DBN) (Dean and Kanazawa, 1989). A DBN is a graphical representation of a stochastic process that leverages conditional independence. Each node in a DBN is a state, action or observation variable. Figure 3 shows an example of a DBN that defines a transition and reward function. The nodes in the Xi layer are state variables and the nodes in the X0

i layer are action variables. Each action variable has a

conditional probability table(CPT) associated with it. Given the DBN Bayes theorem is

used to factor the transition function into a product of smaller conditional distributions, e.g., P(X0_{, X}0 _{| X}

(9)

X₁ X₂ X₁0 X₂0 O0 P(X₁0 | X₁, X₂) P(X₂0 | X₂) P(O0| X₁0) X₁ X₂ P(X₁0 | X₁, X₂) D_1,1 D_1,2 D_1,1 D_2,1 0.8 0.2 D_1,1 D_2,2 0.8 0.2 D_1,1 D_2,3 0.8 0.2 D_1,2 D_2,1 0.6 0.4 D_1,2 D_2,2 0.4 0.6 D_1,2 D_2,3 0.4 0.6

Figure 3: Example use of a DBN. The variables in the Xi slice are

pre-action variables and the variables in the X0

i slice are action variables. The directed edges indicate a causal dependence between two variables, e.g., X0 1 is dependent on both X1 and X2. Moreover, the observation variable O is dependent on action variables. The conditional probability table (CPT) for X0 1 is shown, others are omitted.

X₁0

X₁ X₁

.8 X₂ X₂ .2

.4 .6

D_1,1

Figure 4: Example ADD representation of the CPT in Figure 3. Note that the

edge labels are left out for brevity. This ADD encodes the same probability distribution as the CPT.

factored similarly.

Symbolic Persues Symbolic Persues (Poupart,2005) is a POMDP solver that utilizes

the factored representation in order to solve POMDPs consisting of large state spaces. As underlying data structure Symbolic Perseus uses Algebraic Decision Diagrams (ADD), an example of the transition for X0

1represented as an ADD is shown in Figure 4. Moreover, Symbolic Perseus uses an anytime algorithm, i.e., an algorithm that is expected to return better result by increasing the run time, but is able to return a result when interrupted at any time.

2.2.1 Applications of Decision Processes in ITSs

Barnes and Stamper(2008) propose a method to construct an MDP based on historical data of student solutions in a logic proof tutor. Given the current state of the learner’s proof the MDP is used to generate a hint to help the learner find a next step Using this approach

(10)

a tutoring highly contextual hints can be generated without the reliance on examples by teachers. Moreover, this approach allows the knowledge model to learn from new student data.

Several studies have demonstrated the use of POMDPs to construct policies, also known as teaching policies, to select a teaching action given a belief state of the knowledge of a learner.Rafferty et al.(2011) studied the effect of different learner models when using POMDPs to construct a teaching policy. In this study three different learner models are used, namely: (1) a memoryless model; (2) a discrete model with memory and (3) a continuous model. The memoryless model is a simple model in which the learner state takes the form of the single concept that the learner believes is true. This concept is updated if the presented information contradicts this belief. In this case the concept is replaced with a concept that is in consensus with the presented information. Similarly, the state of the discrete model with memory consists of the single concept that the learner believes to be true. However, this model also stores the last M actions. Whenever new information is inconsistent with the knowledge state of the learner, the state is replaced with a state that is consistent with the new information and the M stored actions. Thirdly, the continuous model defines the state as a probability distribution over all possible concepts. In the event of new information, the probability of concepts that are in contradiction with the information is set to zero, followed by a re-normalization of the distribution.

Rafferty et al.(2011) report that the use of different learner models result in different teaching policies and that the more complex models perform better. The memoryless model is too simplistic or pessimistic about the abilities of the learner and therefore performs less. Another approach byFolsom-Kovarik et al.(2010) leverages the partial observability assumption of POMDPs by including the cognitive state of a learner in the model. The belief of the cognitive state of a learner is used to adjust the expected result from a teaching action. For example, if a learner is weary she will less likely learn from reading a paper than when she is lively and energetic.

The problem of finding an optimal policy for a POMDP is known to be PSPACE-complete (Papadimitriou and Tsitsiklis,1987); therefore, in order to solve problems formulated as POMDPs either the possible size of the POMDP is limited or, if possible, the domain structure needs to be leveraged.Brunskill and Russell(2010) propose a method that leverages the domain structure to compute an envelope of reachable states that are considered. Specifically POMDPs that are factored, have positive only effects and have unique preconditions for each variables are taken into consideration. Brunskill and Russellcall POMDPS that poses these properties POFUPP processes. A POFUPP process is a specialization of a POMDP in which the knowledge state is represented as a single vector and that only allows transitions from

not knownto known, that is, forgetting is not modeled. Moreover, every concept (or variable)

has unique preconditions, i.e., the model assumes that it is not possible to learn a concept before the preconditions of this concept have been fulfilled. Leveraging the properties of a POFUPP processBrunskill and Russellpropose an algorithm called RAPID that is able to find teaching policies even when the model consists of a very large state space, whereas other state-of-the-art POMDP solvers such as Symbolic Perseus (Poupart,2005) were unable to find a policy within a reasonable time span.

2.2.2 Evaluation of Effectiveness of Automated Tutors

Two main streams of evaluation for planning methods in ITSs are distinguishable in the literature, namely: (1) classroom evaluation and (2) evaluation by simulation. The classroom evaluation, as employed byRafferty et al.(2011) andStamper et al.(2011), deploys the

(11)

experimental method in an ITS or other type of software and evaluates whether students using the experimental method perform better than students not using the experimental method. The evaluation by simulation consists of detailed experiments to investigate the effectiveness of the experimental method. Evaluation by simulation is, among others, used byBrunskill and Russell(2010) andFolsom-Kovarik et al.(2010). This method uses a simulation to simulate a learner that uses the experimental method; therefore, the results from this evaluation may not give concluding evidence on the effectiveness when used by human learners. However, by comparing the results to that of existing methods or by defining specific metrics such as robustness, useful result may be obtained. For example,Irissappane et al.(2014) describe a method that evaluates the policy, as found by the planner, using a simulation in which the parameters are adjusted to not match those of the POMDP model. Using this method an informed assessment can be made whether the agent using the policy shows similar behavior in different situations.

2.3 Problem Context

The current study is conducted in the context of DynaLearn. More precisely, DynaLearn is used using a specific model, namely a population model. This sections introduces both DynaLearn and the population model.

2.3.1 DynaLearn

As discussed earlier, DynaLearn (Bredeweg et al., 2013) is an ILE that lets learners express conceptional knowledge via a workspace that supports both model construction and simulation. The representations that are used to express conceptional knowledge are scaffolded into a hierarchy of increasing model complexity called learning spaces. Here learning space 4: causal differentation is taken into consideration. This learning space allows learners to make a distinction between two types of causality in a system in the form of influences and proportionalities. Both are discussed in Section 2.3.2.

The models created using DynaLearn are qualitative reasoning (QR) models. The QR-models allow a modeler to express its belief or state hypothesis focusing on the conceptual characteristics of a system. Moreover, as the models can be simulated they can show the consequences of what is believed to be true (Bredeweg et al., 2013). An example of a DynaLearn model is shown in Figure 5.

DynaLearn provides several help systems to the learner which range from basic help that helps the learner understand the DynaLearn software, to model simulations that allow a learner to explore the effects of modelling decisions. Furthermore, recommendations based on expert made models can be generated that help learners decide on model improvements or next steps. Although DynaLearn does provides these help systems, all systems are on-demand and are only activated if the learner actively chooses to enable them.

2.3.2 DynaLearn Models and Simulation

Key components of QR-models in DynaLearn are quantities. Quantities represent the dynamic features of a system and consist of a magnitude that describes the current value and a derivative that encodes the direction of change. The values that a feature can assume are defined for each feature and consists of a discrete, ordered set called a quantity space such as {Min, Zero, Max} with Min ≺ Zero ≺ Max. Multiple quantities may have the same quantity space, however each may represent different values (Bredeweg et al.,2013). Moreover,

(12)

Element Symbol Description

Entity Entities represents physical elements or abstract concepts that form the system. Entities can by structurally related to other parts of the system (Liem,2013).

Quantity Quantities represent the dynamic, changeable elements of the system.

Influence Influences are directed causal relationships between quantities and cause change within a model, that is, given the magnitude of the source quantity the derivative of the target quantity increases or decrease. A influence can be either positive or negative. Proportionality Proportionalities spread effect throughout the system, that is, the

derivative of the target quantity is set to the value of the source quantity. Also known as indirect influence and proportionalities can also be either positive or negative.

Table 1:Description of (relevant) DynaLearn elements

a DynaLearn QR-model includes entities that represent the physical objects or abstract concepts that when combined form the system. Causal relationships between quantities are expressed using influences and proportionalities. Influences are direct causal relationship and indicate that depending on the magnitude of the source quantity, the derivative of the target quantity increases or decreases. Similarly, proportionalities represent a causal relationship between two quantities where the derivative of the target quantity changes depending on the derivative of the source quantity (Bredeweg et al.,2009). Both influences and proportionalities can be positive or negative, that is, the target quantity’s derivative changes in the same direction or the opposite direction. Although DynaLearn QR-models include more model ingredients, these are not taken into consideration for the research presented in this report. The graphical representation of the model elements is shown in Table 1 and Figure 5 shows the population model that is used throughout this study.

A DynaLearn model can be simulated, the simulation results represent the progression of the system over time. However, duration is not represented in the simulation outcome, instead time is represented as states in a graph, where each state exhibits different behaviour. State transitions occur due to changing quantity magnitudes. The simulation results of the population model are shown in Figure 6.

3 Project Architecture

The aim of this project is to construct an POMDP based intelligent coaching agent for DynaLearn. The architecture supporting the coaching agent in a real-world scenario is shown in Figure 8. This section discusses the elements of the architecute. Moreover, as for the present study the coaching agent is evaluated using a simulation the architecture supporting this simulation is shown in Figure 7. Firstly the simulation architecture is presented. Secondly the differences between the real-world and the simulation architecture are discussed.

(13)

Population Size Zp Plus Zero Birth Zp Plus Zero Death Zp Plus Zero

Figure 5: Model of a population system. The model defines an entity

populationand the quantities size, birth and death. All quantities have the

same quantity space assigned; the derivatives are unassigned and therefore initially unknown. Birth has a positive influence (I+) on size, i.e., a positive

birthrate (which may be steady) causes an increase in size. In turn, a change

in size has a positive effect (P+) on birth. These two causal relationships implement a feedback mechanism. The value correspondence (V) between the zero values indicate that if either quantity takes on value zero, the other should also take on value zero. The relationships between size and death are similar to those of birth and size, only death has a negative influence (I-) on

(14)

1 2

3

4

(a)DynaLearn state graph

Population: Birth Zero Plus 1 2 3 4 Population: Death Zero Plus 1 2 3 4 Population: Size Zero Plus 1 2 3 4

(b)DynaLearn state values

Figure 6: Simulation results of the population model in Figure 5. (a) shows

the state graph of the simulation results, state 2, 3 and 4 are terminal states, i.e., once this state is reached the behaviour of the system will not change. (b) shows the magnitudes and derivatives that the quantities assume in different states. For example, in state 1 birth’s magnitude is plus and the derivative is decreasing. This shows that for the population model the terminal states are: a population that increases and keeps increasing (state 3), a population that is extinct (state 4) and a steady population (state 2).

3.1 Simulation Architecture

The project is decomposed into several sub problems, as shown in Figure 7. The main components are the model desciption (labeled model), the POMDP solver and the simulation. The model includes an (abstract) domain and learner description that is used to construct both the POMDP specification that is used by the POMDP solver and the learner model that is used in simulation. Using the POMDP specification, the POMDP solver constructs a teaching policy (π) that is used in the simulation. Similarly, the variations of the model are used to evaluate the teaching policy on different student archetypes and finally, the simulation stores simulation results in a simulation report.

3.1.1 POMDP Solver: Symbolic Perseus

In this project the off-the-shelve POMDP solver Symbolic Perseus (Poupart,2005) is used. Symbolic Perseus takes as input a specification of a POMDP that is generated by the model definition and produces a policy that can be used in the simulation or by the coaching agent. The policy is composed of several α-vectors. Each α-vector is related to a specific action and multiple α-vector may be related to the same action. Specifically, for each α-vector the

dot-productbetween the α-vector a and the current belief state can be computed such that

a · b ∈ R. The action that is recommended by the policy is the α-vector that maximizes this dot-product.

(15)

Model Simulation Model

POMDP Spec. POMDP Solver

Student State Sim. Manager

π

Sim. Result Action Belief State Obs. Action

Simulation

Figure 7: Overview of project architecture.

3.1.2 Model Definition

The model for the tutor is defined in the model definition component (labeled model in Figure 7). The model is defined in such way that is can be used to generate the input specification for the POMDP model and serve as a model for simulation. Similar to the input specification of Symbolic Perseus, the model definition used ADDs to specify the transition, observation and reward functions. Although a similar method for specification is used, the model definition is not tightly coupled to Symbolic Perseus and in future work may be used to compare POMDP based tutors to other model based intelligent tutor implementations.

3.1.3 Simulation

The simulation component uses an event based discrete time simulation to simulate the interaction between a student and the intelligent tutor while using DynaLearn. As shown in Figure 7 the simulation consists of a student state, a policy π and a simulation manager. The student state is a simulated representation of a student and is responsible for generating the observations that are used to update the belief state. The student state is based on the model definition, but variations are added in order to simulate different archetypal students. The teaching policy π is obtained via the POMDP solver and is responsible for mapping the belief state of the intelligent tutor to actions. To speak in terms of Figure 2: the simulation manager is the state estimator and the student state is the environment. The simulation consists of the following phases:

1. The belief of the simulation manager is initialized to the initial belief as defined in the model;

2. The student state is sampled from the initial belief as defined in the model; 3. Repeat until termination:

(16)

i. Given the current belief state of the simulation manager an action is selected by the policy π.

ii. The student state is updated based on the transition as defined in the student model;

iii. Given the new student state, observations are generated;

iv. Given the selected action and observation, the belief state of the simulation manager is updated according to the model.

As the simulation may run indefinitely, the simulation is terminated after a predefined number of iterations.

3.2 Real-world architecture

As shown in Figure 8 the real-world architecture is very similar to the simulation architecture. However instead of a simulation the real-world architecture includes a coaching agent and takes place in the DynaLearn environment. Note that this architecture has not been implemented and is only included as an illustration of the real-world application of the coaching agent as proposed in this study.

3.2.1 Coaching Agent

The coaching agent (labeled coach in Figure 8) is the central component of the coaching system. First the coaching agent keeps track of the belief state, i.e., it will perform the belief updates after each coaching step. Secondly, the coaching agent is responsible for mapping the abstract actions that are selected by the teaching policy to concrete actions. To do this, the coaching agent maintains a history of selected actions. This way the coaching agent can ensure diversity of actions over time such that the learner is not confronted with the same action over-and-over again. The specification of the mapping from abstract actions to concrete actions is beyond the scope of this study.

3.2.2 DynaLearn Environment

DynaLearn is the environment in which the coach–learner interaction takes place. The current DynaLearn environment needs to be adjusted such that the interaction between the coach and the learner can take place. Moreover, DynaLearn is required to communicate the observation to the coaching agent. Currently it is unclear if the DynaLearn environment is able to communicate the type of observations that are used by the coaching agent.

4 The DynaLearn POMDP Model

In this section, the DynaLearn POMDP is introduced. An overview is given of the elements in the POMDP model, showing how the problem of teaching in DynaLearn is mapped to a POMDP. Although the main goal of the DynaLearn POMDP is to find an effective teaching policy to teach a target model to a learner, the following constraint needs to be taken into account. As DynaLearn is an open-ended learning environment in which the student is encouraged to explore possible solutions to a problem, the DynaLearn POMDP based tutoring agent should not just lead the learner through the model sequentially. Instead, the tutor should allow a student to explore, and only intervene during exploration if student

(17)

Model POMDP Solver

π

Coach

Dynalearn Abstract action

Belief

Action Observation

Figure 8: Overview of architecture when POMDP based coaching agent

is used in a real-world scenario. The architecture is similar to that of the simulation scenario, however instead of a simulation the interaction takes place in DynaLearn.

environment, knowledge about this learning environment is a prerequisite to learning anything else in that environment, thus instead of modelling only knowledge of the current problem, the model needs to include a belief of the knowledge of the environment, here this knowledge is called meta-knowledge.

The current model focusses on learners that have no prior knowledge on the model, that is, there is no distinction between a learner new to the model and a more knowledgeable learner other than the fact that a more knowledgeable learner may show correct behaviour with a higher probability. A possible method to overcome this by obtaining a better belief of the knowledge levels of the learners is discussed in Section 8.4.

4.1 States

The state consists of knowledge levels on both concepts of the model and meta-knowledge, the current state of the student and the current state of the model. Let Q be the discrete set of knowledge levels, C the set of model knowledge concepts and M the set of meta-knowledge. Then, the state can be expressed as a tuple hKC, KM, F, E, Ai where KC ∈ Q|C | is a vector indicating the knowledge levels on model concepts, KM ∈ Q| M | a vector that indicates the knowledge level on meta-knowledge, F ∈ {present, not present}|C | _{a vector of fully} observable variables that indicate whether there is a fault in the model for each of the model knowledge components. E is a fully observable binary variable that indicates whether the student is currently exploring options and A is a binary variable that expresses whether the student is annoyed by the tutor. This results in a state space of |Q||C |+|M |₂|C |+2_{states. In} order to limit the size of the state space only two knowledge levels are taken into consideration such that Q = {no knowledge, complete knowledge} resulting in a state space of 22|C |+|M |+2 states.

In the population model three type of knowledge concepts can be distinguished, namely entities, quantities and causal connections. Together these groups form the set of knowledge concept C of the population model. The meta-knowledge of the population model consists

(18)

VO V_O0

O0

∀s0P(s0 | s, a) = 0.5 VO0 P(O 0_{| V}0

O) present not present

present 1.0 0.0

not present 0.0 1.0

Figure 9: Transition and observation of observable variable. VO is an

observable variable and O an observation associated with that variable. During each time step the distribution over values transitions to a uniform distributions over all possible values. Based on the observation all probability mass is assigned to a single value. Note that not every observable variable uses the values present and not present, however the same principles hold.

of knowledge about entities, quantities, influences and proportionalities. Centity= Population Cquantity=

Size, Birth, Death

Ccausal=

SizeP+→ Birth, Size P+→ Death, Birth I+→ Size, Death I-→ Size

C= Centity∪ Cquantity∪ Ccausal

M=

Entity, Quantity, Influence, Proportionality

This results in a state space of 222_{= 4.194.304 states.}

4.2 Default Transitions

Each variable has a default transition, however an action may override the default transition. The default transition for each variable (type) is discussed in this section, if the transition is overridden by an action this is specified in Section 4.3.

Observable Variables Although a POMDP assumes that the environment is partially

observable, there are features of the environment that are fully observable, i.e., there is no uncertainty about the state of the variable. Examples of this are errors in the model or whether the student is currently in a state of exploration. In the present model the variables fi ∈ Fthat indicate whether an error is made and the variable E are modelled as fully observable variables, For each observable variable an observation is received at each time step. After each time step, for each observable variable the probability mass is assigned to a single value, as is shown in Figure 9.

Knowledge Variables Knowledge is modelled to not increase without explicitly

per-forming a teaching action. Therefore, the default transition for knowledge variables assumes that knowledge does not increase. However, if a mistake is present associated with a concept, the knowledge belief of this concept decreases. The DBN of this transition is shown in Figure 10a.

(19)

ci fi c0_i ci fi P(c_i0| ci, fi) NK CK NK NP 1.0 0.0 NK P 1.0 0.0 CK NP 0.0 1.0 CK P 0.05 0.95

(a)Transition model of concept knowledge variables. The belief on knowledge

of a concept is based on the belief in the previous step and the presence of an error associated with the concept.

A A0

A P( A0| A) True False True 0.85 0.15 False 0.0 1.0

(b)Default transition model of learner annoyance. Learner annoyance is only

dependent on the annoyance in the previous step and decreases over time.

Figure 10:Default transition models of concept knowledge, learner annoyance.

In this figure NK is short for no knowledge, CK for complete knowledge, P for

presentand NP for not present.

Meta-Knowledge Variables The rationale of meta-knowledge is that if multiple errors

are present in the model for a single model element, such as quantities, then this may be attributed to a lack of knowledge on this element. As shown in Figure 11, the belief on meta-knowledge is based on the belief in the previous step and on the presence of mistakes that are associated to instantiations of the meta-knowledge.

Learner Annoyance Learner annoyance is included to let the model reason about

whether the student can be interrupted or not. By default a tutor does not annoy a learner and thus the annoyance decreases over time as is shown in Figure 10b.

4.3 Actions

The following actions are included in the model: (1) Teach(ci): perform a teaching action for knowledge concept ci ∈ Cthat will claim the user’s attention; (2) Teach_Subtle(ci): perform a teaching action that the user may ignore for knowledge concept ci ∈ C; (3)

(20)

mi F₁i ... F_ni m_i0 mi F₁i F₂i P(m_i0 | F₁i, F₂i, mi) NK CK NK NP NP 1.0 0.0 NK NP P 0.9 0.1 NK P NP 0.9 0.1 NK P P 1.0 0.0 CK NP NP 0.0 1.0 CK NP P 0.4 0.6 CK P NP 0.4 0.6 CK P P 0.4 0.6

Figure 11: Default transition model of knowledge. The belief of

meta-knowledge mi ∈ M is based on the belief of X in the previous step and the presence of errors in instantiations (Fi

1, . . . , Fn) of the meta-knowledge.i For example, errors in quantity spaces increase the belief for a lack of meta-knowledge on quantities. Note that for brevity only two fault variables are taken into account in the CPT, however there may be more. In this figure NK is short for no knowledge, CK for complete knowledge, P for present and NP for not present.

do_nothing: do nothing in the next time step. Each of these actions, except for do_nothing

can be seen as abstract actions, i.e., the software using the POMDP can choose a specific action based on the abstract action that is selected by the teaching policy. For example, in the population model, Teach(Birth) may result in an explanation that Birth is a rate and thus can not be negative, however if a mistake related to Birth is present, this may be just pointed out. For each action the transition, observation and reward model is discussed in the following sections. For each action a cost function is described, the cost function is used as a negative reward, that is, an action with a high cost will receive a lower reward for reaching state s0_{than an action with a lower cost for reaching the same state s}0_.

4.3.1 Teach Action

The teach(ci) action models an intrusive action that aims to teach a model related concept ci ∈ C. This action is intrusive in the way that the user can not ignore the action. This may be achieved by using a virtual character or by any other means of getting the user’s attention.

Transitions The Teach(ci) action overrides two default transitions, namely that of the

transition of the concept ci and that of learner annoyance. The transition for concept ci models the fact that a learner may acquire knowledge on concept ci, however it may also get confused due to information that is in conflict with the current belief of the learner. However, if there is a meta-concept mi ∈ M associated with ci the transition is influenced by the state of the meta-concept. If miis known ciis easier to learn and if miis not know ci is harder to learn. The learner annoyance does increase and is dependent on whether the learner is in a state of exploration. The DBN of the transition is shown in Figure 12.

(21)

ci fi c0_i ci fi P(c_i0| ci, fi) NK CK NK P 0.8 0.2 NK NP 0.7 0.3 CK P 0.4 0.6 CK NP 0.0 1.0

(a) Transition model for the concept

knowledge variable ci when performing

Teach(ci). The transition is dependent on

the knowledge level for ci in the previous time step and the presence of an error as-sociated with ci. This transition is used when there is no meta-knowledge variable associated with ci.

A E A0 A E P( A0| A, E) True False True True 1.0 0.0 True False 1.0 0.0 False True 0.7 0.3 False False 0.3 0.7

(b)Transition model for student

annoy-ance (A) when Teach(ci) is performed. If Teach(ci) is performed the annoyance never decreases, moreover if the student is currently in exploration the annoyance increases more than when the student is not exploring options.

ci fi m c_i0 ci fi m P(c0_i | ci, fi, m) NK CK NK P NK 0.95 0.05 NK P CK 0.7 0.3 NK NP NK 0.85 0.15 NK NP CK 0.65 0.35 CK P NK 0.5 0.5 CK P CK 0.35 0.65 CK NP NK 0.1 0.9 CK NP CK 0.0 1.0

(c)Transition model for ciwhen Teach(ci) is performed and a related meta-knowledge variable is part

of the model. If there is complete knowledge for the meta-variable the chance of learning ci increases whereas the absence of meta-knowledge negatively influences the probability of learning ci.

Figure 12: Transition model for Teach(ci). In this figure NK is short for no

(22)

c0_i _D0 B0 O0 P(O0 | D0, B0, ci0) = P(B 0_D0 _{| c}0 i) c0_i P(D0| c_i0)

Short Medium Long

NK 0.45 0.25 0.3 CK 0.4 0.4 0.2 c0_i P(B0 | c0_i) Incorrect Correct NK 0.8 0.2 CK 0.2 0.8

Figure 13: Observation model for Teach(ci) and Teach_meta(mi). The

observation is a compound, or joint, observation. Note that D0_{and B}0_{are not} part of the DynaLearn POMDP model and are only included for illustration. In this figure NK is short for no knowledge and CK for complete knowledge.

Observations When a Teach(ci) action is performed, a compound observation is received.

A compound observation is a single observation that represent two separate observations. In this case the two observations are behaviour, which can be either correct or incorrect, and the duration that it took the learner to show the behaviour, which can be either long,

mediumor short. The observations can also separately be included in the model, however,

the definition of the compound observations is less complex as the observations are related and both only depend on the knowledge level of a single concept.

Cost The cost function for Teach(ci) is composed of several penalties that are dependent

on the belief state. First a penalty is given if the Teach(ci) action is performed when ci is in a state of complete knowledge. Secondly there is a penalty on Teach(ci) if the student is annoyed. Moreover, when the student is exploring options a penalty is given, however if there are mistakes in the model the cost is significantly lower then when the model is correct. In other cases only a small cost is paid relative to the time the action takes compared to other actions. Furthermore, a penalty is computed if prerequisite concepts are in a state of no

knowledge. In the case of entities there are no prerequisites, for quantities all other entities

are prerequisite concepts and for causal relations quantities are modelled as prerequisite concepts. This is to ensure that the POMDP only selects actions when the learner is ready to respond to those actions.

4.3.2 Teach_Subtle Action

The Teach_subtle(ci) action is similar to the Teach(ci) action as it models a teaching action of a specific concept ci ∈ C. However, in contrast to, Teach_subtle(ci) does not take the duration into account. Moreover, the Teach_subtle(ci) models an non-intrusive action, i.e., the user is free to ignore the action.

Transitions As the Teach_subtle(ci) action models a teaching action that is easily ignored

(23)

ci fi c_i0 ci fi P(c_i0 | ci, fi) NK CK NK NP 0.95 0.05 NK P 0.99 0.01 CK NP 0.0 1.0 CK P 0.5 0.95

(a)Transition model for ci when the

action is performed. There is a small chance that the learner will learn a con-cept using , however as long as there are no errors in the model the knowledge level of c_i0 _O0 c_i0 P(O0 | c_i0) Incorrect Correct NK 0.8 0.2 CK 0.2 0.8

(b)Observation model for learner

be-haviour if the Teach(ci) action is se-lected. The learner will show correct behaviour with high probability if he has knowledge about the concept and with low probability if the learner is in a state of no knowledge.

Figure 14: Transition and observation function for Teach_subtle(ci). In this

figure NK is short for no knowledge, CK for complete knowledge, NP for not

presentand P is short for present.

knowledge concept ci diverts from the default transitions, all other variables are modelled according to the default transitions, as is shown in Figure 14a.

Observations As said, the duration for a response is not taken into account as a learner

may ignore the action. Thus, the observation only consists of a qualification of the behaviour of the user. That is, the user may show either correct or incorrect behaviour. The DBN for the observation for the Teach_subtle(ci) action is shown in Figure 14b.

Cost Much like the cost function of the Teach(ci) action, the cost function for the

Teach_subtle(ci) action is also a combination of several penalties. First a penalty is given for

executing the Teach_subtle(ci) action when the student is in a state of complete knowledge for ci, however, this penalty is significantly lower than that of Teach(ci). Moreover, when the learner is in a state of exploration while there is an error in the model, a significant penalty is imposed on selecting the Teach_subtle(ci) action. This is based on the assumption that an easy to ignore hint will not help the learner to address the error when in a state of exploration. In all other states the cost function specifies a cost relative to the time that the action will take.

4.3.3 Teach_Meta Action

The Teach_meta(mi) action is similar to the Teach(ci) action in that it models an intrusive action, however instead of teaching concept knowledge related to the model, an overarching meta-concept mi ∈ Mis taught. For example, if the learner does not understand any quantity, this may be the result of an incorrect idea about quantities. In that case the Teach_meta(mi) action is used to correct the knowledge of this meta-concept.

(24)

mi m0_i

mi P(m_i0| mi)

NK CK

NK 0.7 0.3

CK 1.0 0.0

Figure 15: Transition model for Teach_meta(mi). The meta-knowledge m0_i

is dependent on the knowledge level for miin the previous time step. In this figure K is short for no knowledge and CK for complete knowledge.

Transitions The Teach_meta(mi) action’s transition model overrides the default transition

model for meta-knowledge. The Teach_meta(mi) action increases the belief that the learner has knowledge about a meta-concept. The DBN of the transition function is shown in Figure 15. Although the Teach_meta(mi) action is an intrusive action similar to the Teach(ci) action, the model does not include increased annoyance as the action will directly help the student to correct errors in the model.

Observations The observation model for the Teach_meta(mi) action is the same as that

of the Teach(ci) action and is shown in Figure 4.3.1.

Cost The cost function for the Teach_meta(mi) action is composed of the following

elements. First a significant penalty is given for executing the Teach_meta(mi) action in a state in which the student has knowledge about the meta-concept. Moreover, the action should not (often) be executed if the student is in a state of exploration, therefore a penalty is given to executing the action when in exploration. In all other cases the cost is computed relative to the duration of the action.

4.3.4 Do_Nothing Action

The Do_nothing action allows the tutor to wait and let the learner figure out what to do next or finish a current task. No specific transition or observation is defined for this action. This means that no knowledge observation is received. However, the observation for fully observable variables are received. Thus, information on model errors and the exploration state are available to the tutor in the next time step. The cost of the Do_nothing action is set to one.

4.4 Reward Model

Rewards are associated to states in which one or more concept knowledge variables ci ∈ C are in a state of complete knowledge. The reward increases linearly with the number of concepts ci ∈ Cthat are known by the learner. The reward per time step is computed by taking the reward as described here minus the cost for the selected action.

(25)

Population

Size

Zp Plus Zero

Birth

Zp Plus Zero

Figure 16: Restricted DynaLearn model used for evaluation

5 Evaluation

Section 4 describes the DynaLearn POMDP model, however, as the design of the model is based on assumptions, the teaching policy generated using the POMDP may not be effective for real-world students. Although a classroom experiment that would allow a substantiated statement on the effectiveness of the DynaLearn POMDP is out of scope for the present study, simulations can help to examine the effect on several types of student or applications. This section describes two methods of evaluation using simulation. The results of these methods are presented in Section 6.

5.1 Reduced Teaching Domain for Evaluation

The DynaLearn POMDP model as presented in Section 4 describes a scheme to map a DynaLearn reference model to a POMDP. This can be used to construct a POMDP for the population model. However, as discussed, this results in a POMDP with a very large state space. Consequently, finding a policy for a POMDP of this size within acceptable time using an off-the-shelve POMDP solver is unrealistic. Therefore, for the evaluation, a smaller model, as shown in Figure 16, is used. Although this reduced model does not represent any interesting real world system that might be modelled in DynaLearn, is does contain all the elements that are taken into consideration by the POMDP model.

The restricted model contains of four knowledge concept, i.e., one entity, two quantities and a single causal relationship, in addition only a single meta-concept is taken into account, namely the meta-concept quantities. This model results in a state space of 211 _{= 2048 states} which is several orders of magnitudes smaller than when using the complete population model.

5.2 Tutor Effectiveness when using Homogeneous Learner

Mod-els

An important assumption in the DynaLearn POMDP as described in Section 4 is that each separate type of knowledge has the same transition model, i.e., each type of concept is equally difficult to learn. However, in the real world there is a great variety in possible difficulty levels. Although it is clear that the model does not cover the complexity in the real world, this does not necessarily mean that the teaching policy is inefficient when used in

(26)

a domain with different difficulty levels. In other words, the DynaLearn POMDP may be robust to changes in the model.

In order to examine whether the simplified assumption on concept difficult has impact on the tutor, an alternative version of the DynaLearn POMDP is created uses a heterogeneous learning difficulty model, i.e., the transition from no knowledge to complete knowledge for causal relationship is less likely than the same transition for quantities. In turn, learning quantities is more difficult than acquiring knowledge about entities.

Given the DynaLearn POMDP, which includes a homogeneous learning difficulty model, and the heterogeneous POMDP, the effect of the simplifying assumption is examined by simulation using a learner model that consists of a learning difficulty model similar to that of the heterogeneous POMDP. Results of teaching effectiveness and general behaviour of both POMDPs is reported in Section 6.1.

5.3 Teaching Effectiveness and Tutor Restrictiveness for

Differ-ent StudDiffer-ent Types

The DynaLearn POMDP model describes a single learner, however in the real world there are several types of learners. Each type of learner may learn in a different pace or style. Moreover, as described in Section 4 one of the design goals is to have a tutor that is restrictive in interfering with the learner. Therefore both the effectiveness and the restrictiveness of the DynaLearn POMDP is examined for different types of student by means of simulation.

The simulation simulates a learning scenario in which the learner attempts to learn about the population model. The following three simulated student are used:

Weak insecure learner The weak insecure student initially has little knowledge about the

population model. Moreover, the learner is not likely to explore optional solutions in DynaLearn, i.e., the learner will wait for a tutor to tell what next step to take. In addition, the student does not easily learn or acquire knowledge, that is, often the learner will need multiple explanations on the same concept in order to fully grasp the concept. In terms of behaviour that is (indirectly) observed by the tutor, the weak and insecure learner will make mistakes with high probability, it is not likely to enter a state of exploration and when it does it is only of short duration. Moreover, the learner only transitions from a state of no knowledge to a state of complete knowledge with low probability.

Weak confident learner The weak and confident learner is similar to the weak insecure

learner as it does not easily learn new concepts. In contrast to the weak and insecure learner, the weak confident learner does not tend to eschew from exploring potential solutions to the assignment, however, when in exploration the learner makes mistakes with higher probability.

Strong learner The strong learner initially starts with little knowledge about the population

model. However, it easily learns new concepts and thus, relative to the other simulated learners, transitions from a state of no knowledge to a state of complete knowledge with high probability. Moreover, the strong learner does not shy away from exploring possible solution to a model, however, this will not change the chance of the strong learner making errors.

Using these simulated learners, the behaviour of the teaching policy as found by solving the DynaLearn POMDP model is evaluated using a simulation setup as described in

(27)

model knowledge over multiple runs. The tutor restrictiveness is measures by the ratio between intrusive and non-intrusive actions. The results of this evaluation are reported in Section 6.2.

6 Results

This section presents the results of the evaluation methods described in Section 5. A discussion of the results presented in this section is given in Section 7.

6.1 Tutor Effectiveness when Using Homogeneous Learner

Mod-els

The results averaged over 25 simulation iterations using both a tutor with a homogeneous learning difficulty model and a heterogeneous learning difficulty model are displayed in Figure 17. As shown if Figure 17a the simulated student acquires knowledge in a similar fashion using either type of tutor, i.e., there is no significant difference in the effectiveness of the tutors using either model. However, the teaching policies used by both tutors do differ in the following respects:

Use of intrusive actions As shown in Figure 17b the homogeneous tutor is far more

likely to use an intrusive action than the heterogeneous tutor. The heterogeneous tutor limits the use of intrusive actions to states in which the tutor beliefs with high probability that the student has no knowledge about a concept. In contrast, the homogeneous tutor also uses intrusive actions in situations where the student is not in a state of exploration and there is limited belief that the student has no knowledge about a concept.

Aggressive start Although both the homogeneous and the heterogeneous tutor show

a preference for intrusive actions in the first twenty time steps, this behaviour is far more apparent for the homogeneous tutor. On average the homogeneous tutor uses twenty intrusive actions in the first twenty time steps, whereas the heterogeneous tutor on average only uses ten intrusive actions in the first twenty time steps.

No use of Teach_meta(mi) and Do_nothing The heterogeneous tutor never selects the Teach_meta(mi) and Do_nothing actions, i.e., it finds it more effective to teach the concepts separately and simultaneously has a preference of constantly optimizing the knowledge of the student.

Tutor diversity The teaching policy used by the homogeneous tutor is much more diverse

than that of the heterogeneous tutor. This is clearly visible in Figure 17b as the standard deviation for the homogeneous model is significantly bigger that that of the heterogeneous model, i.e., the homogeneous shows different behaviour in different situations whereas the heterogeneous model uses a similar approach independent of the situation.

(28)

0

20

40

60

80

100

120

140 Time steps

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2 Model knowledge in %

Homogenous Model

Heterogenous Model

(a)Teaching effectiveness of a tutor agent using a heterogeneous model and a

tutor agent using a homogeneous model. The y-axis indicates the percentage of the model that the simulated student has knowledge off, whereas the x-axis are event based time steps, i.e., one time step per action taken by the tutor.

0

20

40

60

80

100

120

140 Time steps

0

10

20

30

40

50

60

70

80 Number of intrusive actions

Homogenous Model

Heterogenous Model

(b)Use of intrusive actions by tutor using a heterogeneous model and a tutor

using a homogeneous model. The y-axis represents the number of intrusive actions taken by the tutor, whereas the x-axis indicates the simulation time steps.

Figure 17: Teaching effectiveness of a tutor using a heterogeneous learning

difficulty model and a tutor using a homogeneous learning difficult model. In both cases a simulated student using a heterogeneous difficulty is used. Each experiment is executed 25 times; the translucent colors indicate one standard deviation

(29)

6.2 Teaching Effectiveness and Tutor Restrictiveness for

Differ-ent StudDiffer-ent Types

In Figure 18 the teaching effectiveness of a tutoring agent using the DynaLearn POMDP for different student types, averaged over 25 iterations, is shown. The DynaLearn POMDP is more effective in teaching a strong student than a weak student, however for both types of weak student the DynaLearn POMDP is able to teach up to 80 percent of the model. The time needed to teach the model differs between the weak and strong students. A strong student reaches 80 percent model knowledge on average after twenty time steps, whereas the weak student may need up to 60 time steps to reach the same level of knowledge. There is no significant difference in the teaching effectiveness between a weak insecure student and a weak confident student.

Figure 19a shows the number of times a Do_nothing action has been selected after n time steps for each student type and Figure 19b shows the number of times an intrusive action has been selected after n time steps. Although initially the DynaLearn POMDP treats each student similarly, after twenty time steps differences in the behaviour of the DynaLearn POMDP for different student types emerges. This is most clear in the following aspects:

Strong learner receives less teaching actions Figure 19a clearly shows that the

number of Do_nothing actions that a strong learner receives is significantly higher than the number of Do_nothing actions that a weak learner receives, i.e., the DynaLearn POMDP is more reluctant to perform a teaching action for strong students.

Less intrusive behaviour when in exploration As shown in Figure 19b there is a

clear difference between the number of intrusive actions received by a weak insecure student and a weak confident student. This indicates that the tutor is less likely to use intrusive actions when a student is often in a state of exploration.

Transition from intrusive to non-intrusive behaviour Initially each student

re-ceives mainly intrusive learning actions. However, as can be seen in Figure 20, over time the tutor transitions to selecting more non-intrusive actions. This transitions takes places at dif-ferent moments for difdif-ferent student types. For strong students the tutor reaches a break-even

pointafter about 40 time steps, after this point the tutor mainly select non-intrusive actions.

For a weak confident student this point lies around 90 time steps. The results indicate that there exists a similar break-even point for a weak and insecure student, however this lies beyond the 150 time steps taken into consideration by this evaluation.

7 Discussion

The DynaLearn POMDP proposed in this study is evaluated using a simulation of the envisioned users. As a simulation does not match the complexity of a real-world tutoring scenario, emphasis needs to be placed on the fact the results obtained using the simulation need to be interpreted cautiously, however it does provide a realistic indication for the applicability of the used approach in a real-world tutoring setting.

As presented in Section 6.1, the teaching effectiveness of the homogeneous POMDP is similar to that of the heterogeneous POMDP. This indicates that not all domain complexity needs to be represented in the POMDP model in order to find effective teaching policies, i.e., the DynaLearn POMDP is robust to changes in the teaching domain. However, the

(30)

0

20

40

60

80

100

120

140 Time steps

0.0

0.2

0.4

0.6

0.8

1.0 Model knowledge in %

_{Insecure Weak Student}

Confident Weak Student

Stong student

Figure 18: Teaching effectiveness of DynaLearn POMDP based tutor for

different types of simulated students. The y-axis represents the percentage of model concepts that the student has knowledge of, whereas the x-axis indicates the number of event-based simulated time steps.

homogeneous POMDP is the result of an iterative design process over a longer period of time, whereas the heterogeneous POMDP is an adjusted version of this homogeneous POMDP. Thus, it can currently not be ruled out that other results may be obtained if a similar design process is followed for the heterogeneous POMDP. Moreover, the results show that the teaching policies found using the two POMDPs differ significantly. The homogeneous POMDP is much more likely to select an intrusive action than the heterogeneous POMDP. This can be attributed to the use of a similar reward model for both POMDPs, that is, both models use the same reward model, however in the heterogeneous POMDP the expected reward for several teaching actions is lower as the probability of transitioning to a state of complete knowledge is lower. As the difference in the transition model for the

Teach_subtle(ci) action is smaller between the two models as for other actions, this explains

the preference for non-intrusive actions.

Results of both evaluation methods show that the homogeneous POMDP has an aggressive start. Initially it will mostly use intrusive actions to transition to more subtle actions later in the process. This can be attributed to the definition of the reward model. As a positive reward is only awarded when the student is in a state of complete knowledge for one or more knowledge concepts, the logical behaviour of the homogeneous POMDP is the use actions that optimize the reward as fast as possible. This indicates that another reward function may be more applicable for use in an ILE setting in which expectant tutor behaviour is preferred over a steering tutor.

The results of the simulated student evaluation indicate that the DynaLearn POMDP is capable of tutoring different types of students. Apart from the initial phase in which the use of intrusive actions is predominant, each student type is tutored in a way appropriate for the student, i.e., a strong student receives little feedback, the confident weak student is mostly hinted in the right direction and the insecure weak student is mostly steered. Moreover the transition from intrusive to subtle teaching actions during a tutoring session indicates that

A POMDP Approach to select Tutoring Actions in an Interactive Learning Environment