• No results found

Testing children’s theory of mind: suitability of the strategic n-coin game.

N/A
N/A
Protected

Academic year: 2021

Share "Testing children’s theory of mind: suitability of the strategic n-coin game."

Copied!
38
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

n-Coin Game.

O. Alexander Savi

University of Amsterdam

Abstract

The n-coin game is evaluated as a possible measure of Theory of Mind (ToM). The game is used to assess both first- and second-order reasoning, has high ecological validity, and can be played with equipment readily avail-able. Suitability is assessed using structured interviews and unstructured play behavior. Ratings of verbal expressions in an ecological variant of the game showed the principal suitability of the game to assess ToM with 7-to 9-year-old children. In addition a computerized variant of the game showed the principal suitability of solely using the play behavior to assess ToM with adults. Combining both findings strongly suggests the suitability of using solely play behavior on the n-coin game to test children’s ToM.

Keywords: Cognitive Development, Game Theory

Imagine buying a birthday present for your sister-in-law. Would she fancy a book? If so, what kind of a book would she prefer? Such reasoning about the mental models of others requires Theory of Mind (ToM). ToM refers to the ability to differentiate between the reasoning (and beliefs, desires, strategies, thoughts, intentions, and so on) of oneself and others. Although methods to assess the development of ToM exist, these have shown to have numerous shortcomings (e.g., Bloom & German, 2000). Therefore, in the current study the n-coin game (Schwartz, 1959) is explored as a means of a new method to assess ToM and possibly overcome the shortcomings of current methods.

The term ToM was first coined by Premack and Woodruff (1978). In ToM different orders of reasoning can be distinguished. First, one might not be able to take into account the reasoning of others, and thus only being able to use so-called zeroth-order reasoning.

O. Alexander Savi, Department of Psychological Methods, University of Amsterdam.

Many thanks go out to Han van der Maas (University of Amsterdam) and Maartje Raijmakers (University of Amsterdam) for their supervision of this research project. Thanks to Eric-Jan Wagenmakers (University of Amsterdam) for sharing his thoughts on developing a sophisticated model to explain the use of ToM in the n-coin game. Thanks to Iris Hagen, Dinda Maas, Roy van Oorschot, Thomas Oosterink, and Mariëtte Scholten for their contribution to the project.

Correspondence concerning this report should be addressed to O. Alexander Savi, University of Am-sterdam, Department of Psychological Methods, Weesperplein 4, 1018 XA AmAm-sterdam, The Netherlands. E-mail: o.a.savi@gmail.com

(2)

Being able to actually do take the reasoning of others into account is called first-order reasoning. Of course, one might also take into account that others think about how other people reason, and thus being able to use second-order reasoning. This can be extended virtually limitless, and one can generally speak of higher-order reasoning.

ToM is traditionally measured using a false belief test, first introduced by Wimmer and Perner (1983). To pass such a test the participant needs to understand that although she or he holds a correct belief, someone else may hold a false belief, and thus be able to use ToM. Wimmer and Perner (1983) established, using false belief tests, that the ability to use first-order reasoning emerges in 4-to 6-year-old children. From the age of 6, virtually all children are able to use first-order reasoning. Using comparable methods, the same authors established two years later that many 6-year-old children and most 7-to 9-year-old children are able to use second-order reasoning (Perner & Wimmer, 1985). Baron-Cohen, Leslie, and Frith (1985) transformed the false belief test into the well-known Sally-Anne test.

Although the Sally-Anne test has become and still remains the standard test for testing ToM, more recently the use of strategic games received attention. Strategic games have some advantages over the traditional tests (Flobbe, Verbrugge, Hendriks, & Krämer, 2008). One important advantage is that, contrary to false belief tests like the Sally-Anne test, games do not rely on language skills. Also, games are applied tests, in which ToM is not directly called upon, but where it gives the player an advantage to use it. Games thus approximate the real-life use of ToM more than do traditional tests.

Hedden and Zhang (2002) introduced the use of strategic games in studying ToM most formally. They used a category of games called Stackelberg games to investigate the use of ToM. Stackelberg games are games played by two players, in a sequential manner, and with a finite horizon. This category of games has been subsequently studied by, e.g., Flobbe et al. (2008) and Raijmakers, Mandell, van Es, and Counihan (2013), however with mixed success. One major problem with these games is the high dependence on working memory (Raijmakers et al., 2013). de Weerd and Verheij (2011) found that although memory capacity cannot fully explain the results of those games, it does obscure a proper assessment of ToM.

As to date no satisfactory results could be obtained using games, yet the benefits of games in studying ToM remain numerous, the current study explores and evaluates the use of a different strategic game as a means to measure ToM. This game, the n-coin game (Schwartz, 1959), is also known as the bar game Spoof and fits the Stackelberg category1. Since the game is actually played for fun, it is assumed to have high ecological validity, something that many methods to study ToM lack. Before discussing possible advantages of this game over the other Stackelberg games, it is thoroughly introduced.

n-Coin game

The n-coin game is typically played by two players. Both players receive n tokens each (n = 2 in Experiment 1 and n = 3 in Experiment 2). Each player takes zero to n tokens and covers them in his or her closed hand, without communicating the exact amount

1It is important to note that although the previously mentioned games and the current game both have complete information (i.e., the desires and beliefs of the other player become apparent since previous moves are common knowledge), the n-coin game deviates from the collection of Stackelberg games, by the fact that it has imperfect information (i.e., the number of coins in the players’ hands is hidden from the other player).

(3)

of tokens. One of the players now guesses the total amount of tokens in both hands. The other player subsequently does the same, with the restriction of not being allowed to guess the same sum as the player that guessed first. After both players have guessed, they open their hands to reveal the correct answer. The player that guessed the correct sum is the winner. The game is played repeatedly and players then take turns in being the first to guess. According to (Schwartz, 1959), the winner takes one coin from the other player. However, since this would mean that rounds of the game can differ with respect to the distribution of coins, in the current study the winner of a round receives one point, and the player that collects the most points after a fixed number of rounds wins the game.

The n-coin game allows for the identification of both first- and second-order reasoning. The first player in the game has two strategies, either trying to guess the correct sum and having a chance to win the game, or guessing a necessarily wrong sum in order to deceive the second player and provoke a tie. Deception of the second player necessarily requires first-order reasoning, as one needs to understand that the mind of another person can be manipulated. The second player in its turn has two sensical strategies as well, either guessing a sum using the information of the first player’s guess and assuming the first player tries to guess the correct sum, or taking deception into account. Taking deception into account necessarily requires second-order reasoning, as one needs to understand that someone else might understand that the mind of another person can be manipulated.

More explicitly let us assume the first player has a single coin in her hand. She knows the second player can have zero to three coins in his hands. One strategy is to try and guess the right sum, thus she might guess either one, two, three, or four. However another strategy is to deceive the other player and guess an impossible sum, such that the other player won’t guess the right sum. She thus might guess either zero, five, or six. The second strategy necessarily needs at least first-order reasoning, since the first player needs to be able to reason about the beliefs of the second player. Indeed, the first player might be able to use second-order reasoning or higher, however it is not possible to identify this. Also, if the first player utilizes the first strategy, no order can be identified.

Now let us assume that the first player guesses the sum would be five coins and the second player has zero coins in his hand. The second player has three strategies, of which one is nonsensical. One strategy is to use the information of the first player’s guess to obtain information about the number of coins in her hand. If the second player is only able to use zeroth- or first-order reasoning, he is not able to take into account deception, and will necessarily use this strategy, thus guessing two or three. However another strategy is to take into account possible deception by the first player, thus guessing zero or one. The second strategy necessarily needs at least second-order reasoning, since the second player needs to be able to reason about how player one reasons about someone else’s mental content. The nonsensical strategy is to choose four, five, or six, since the second player is certain it is an impossible sum, and thus results in a certain loss. See Appendix A for the identification of all strategies in a two-coin game.

Although a few exceptions exist, scores of the n-coin game can be categorized using general rules. A zeroth-order strategy is a strategy where the first player guesses a sum that is possible given her or his own hand, or where the second player guesses a sum that is possible given her or his own hand and given the guess of the first player. A first-order strategy is a strategy where the first player guesses a sum that is not possible given her

(4)

or his own hand (i.e., deception). A second-order strategy is a strategy where the second player guesses a sum that is possible given her or his own hand, but not given the guess of the first player (i.e., deception anticipation). An exception to these rules is the situation where the first player already chooses the only possible zeroth-order strategy of the second player, e.g., if the first player guesses four and the second player has two coins, the only zeroth-order choice is already taken and the second player is thus forced to choose between the remaining options. Note that this categorization assumes both comprehension of the basic rules of the game and the understanding that a guess might hold information about one’s hand.

The use of the n-coin game as a means to study ToM has numerous advantages. One important reason lies in the advantage of using games in general, as mentioned before. Another advantage is more specifically related to the n-coin game. Although many game theoretic games lack ecological validity, the n-coin game does not. The game is actually played across the world, and even on tournaments. More commonly the game is known as Spoof or Three coin, or regionally as Bamzaaien, Luciferren, or Handje raden (The Netherlands). Third, both first- and second-order reasoning can be identified using the game. These can also be separated, since the player that proposes the first sum can either apply zeroth- or first-order reasoning, while the player that proposes the second sum can either apply zeroth- or second-order reasoning. Fourth, just as many other games, it can be played repeatedly, and finally, the game does not rely on any special equipment and can thus be played with equipment readily available.

Indeed, the game is disadvantageous in other aspects. For one, although the game does not depend on language skills, it does depend on simple mathematical skills such as counting, summation, and subtraction. The required skills are however relatively easy, and can be easily controlled for. Also, although the game can be used to identify first-and second-order reasoning, the game cannot be readily used to identify higher orders of reasoning. Nevertheless, identifying first- and second-order reasoning is the first objective of this study. If this objective succeeds, subsequent adjustments might allow for the game to be used to identify higher orders of reasoning. Thirdly modifiability is an issue. Hedden and Zhang (2002) showed that through continued play of games, children are capable of adapting to the use of second-order reasoning. Nonetheless, they also show that this adaptation is relatively slow and incomplete, and Flobbe et al. (2008) do not find a learning effect. This however something to be cautious of. Finally, working memory might play an important role, just as in the other Stackelberg games. This will be assessed in the current study. Objectives and Design

The key objective of this study is to investigate whether the n-coin game is suitable for the identification of first- and second-order reasoning with children. Two subordinate objectives can be formulated. The first objective is to assess whether the game allows for a proper assessment of zeroth-, first-, and second-order reasoning. If the answer to this question is affirmative, the second objective is to assess whether the age-related results of the game compare to results from traditional ToM tests. In order for these objectives to be met, (1a) the n-coin game must allow for a categorization of children in the zeroth-, first-, or second-order of reasoning, and (1b) ideally with relative certainty. This categorization should (2a) agree with the natural succession of order of reasoning with age. Ideally (2b)

(5)

age-differences in order of reasoning correspond with those found by using other ToM tests. Since the study is largely exploratory and includes the development of an instrument, it is difficult to formulate strong hypotheses. I therefore tentatively hypothesize that the n-coin game is suitable for the identification of zeroth- to second-order reasoning. Moreover I hypothesize that ToM as measured by the n-coin game is related to age.

The current study consists of two experiments. In the first experiment, the different orders of reasoning are identified by having 7-to 9-year-old children play the game and recording their reasoning through structured interviews. These interviews should reveal how children reason in certain situations of the game, and thus whether they actually think of for instance deceptive strategies. The audio recordings are used to determine the order of reasoning of a child. Their chosen number of coins and proposed sums are also recorded, which should reveal whether they actually apply for instance deceptive strategies. In the second experiment, adult participants play the game against a synthetic player. Only the behavioral data is recorded, and a method is proposed to extract the order of reasoning from the behavioral data.

The age range of roughly 7 to 9 (group 3 to 5) is selected because of its supposed variation in orders of ToM. Although the exact ages of emergence of the different orders of ToM are subject of debate, Wimmer and Perner (1983) show that first-order reasoning starts to emerge around the age of 4 and further develops in the years that follow. 86% of 6-to 9-year-old children show understanding of first-order reasoning. Perner and Wimmer (1985) show in a follow-up study that second-order reasoning starts to emerge around the age of 6. It is important to stress that these results are obtained from false belief tests, and thus far no strong relation could be found between order of reasoning and performance on a strategic game (Raijmakers et al., 2013). Moreover, Raijmakers et al. (2013) found that only 50% of 9-and 10-year-old children used a first-order reasoning in a strategic game. There might thus be a behavioral gap.

With respect to the reasoning of children in the interviews, the findings of Wimmer and Perner (1983) and Perner and Wimmer (1985) should nonetheless hold. With respect to the actual use of first- and second-order strategies, the age range could prove to be slightly narrow. Nevertheless, it is suspected that the selected age range shows enough diversity for two reasons. First, the interview encourages the children to actively reason about possible strategies, thus eliciting the use of these strategies. Second, the experimenter will use the strategies which may serve as an example for the child.

Also, the relation between the verbal and non-verbal responses is an important issue. It is assumed that subjects that are able to explicate their strategies are also able to use these strategies behaviorally. Whether these subjects actually use these strategies during the game remains to be seen and moreover the reverse is not necessarily true either. Raijmakers et al. (2013) noticed a related issue: performance on traditional ToM tests, which rely on verbal reasoning, is unrelated to performance on strategic games, which rely on behavioral use of ToM. They argue that although the different tests share a common structure of reasoning, games might require (increased) executive functioning. This could ultimately lead to the behavioral gap mentioned above. Since the literature is sparse concerning this issue, I can only reason that the behavioral gap might not be as big as would be expected from the above finding. The verbal responses in the current study apply to reasoning about the game that is actually played. There is thus a close link between reasoning about the

(6)

game and the actual play behavior. Nonetheless, the importance of executive functioning for the non-verbal responses should not be underestimated, and it is thus important to be aware of a fragile relation between verbal and non-verbal responses.

I expect that through structured interviews the current order of reasoning of a child can be assessed. I also expect that using these results as a baseline, a quantitative method for the assessment of ToM using solely behavioral data from the game can be devised. Finally, I expect that the results of the n-coin game will only compare to the results from traditional tests with respect to the sequential order of development of orders of reasoning, since the age-range used in the current study does not allow for a full comparison.

Experiment 1 Methods

Participants. Dozens of Dutch elementary schools were invited by e-mail to take

part in this study. Two of the willing schools were selected. The participating children were selected by the schools’ representatives. These representatives were each asked to select 48 participants on the basis of (a) access to Math Garden2, (b) proficiency of the Dutch language, and (c) a more or less equal distribution across grades 3, 4, and 5 of Dutch elementary school (respectively age 7, 8, and 9—at the end of the school year). At one school 46 participants were selected of which 42 were tested. At the other school 44 participants were selected of which 39 were tested. The parent(s) or carer(s) of each selected participant was sent an information brochure and passive informed consent. Of the participants tested 42 were male and 39 were female, aged 6 to 10 (µ = 7.56, σ = 1.02) (the age of one participant is excluded, since her or his age was not registered). Participants were rewarded stickers.

Materials. The experiment consisted of the frequent playing of the two-coin game.

The experimental design underwent minor adjustments after each day of testing (except for the last two days). These adjustments were made to increase the sensitivity of the game (i.e., increase the true positive rate). In total four slightly different experimental designs were used. Although a single experimental design is described below, the designs varied with respect to for instance the structure of the different phases of the experiment, the total number of rounds played, and the number, type, and formulation of the questions in the structured interview. In the analyses no distinction is made between the different experimental designs. First, the adjustments were minor and designs are argued to be comparable. Second, if there was sub optimal specificity, participants were most likely not miscategorized in a lower order ToM (i.e., resulting in false negatives), but categorized as ‘unknown order’ and not included in the analyses.

The two-coin game was played as described in the introduction. Only two coins for each player were used to reduce the cognitive load needed to do the required calculations. Participants played the game against an experimenter. The experimenters’ play behavior was strictly standardized in order to eliminate a possible influence of differences in their used strategies. For each round of each game the number of coins in the experimenters’

2

Math Garden is a third-party web-based computer-adaptive application for practicing math and mon-itoring progress, and is used by many Dutch elementary schools. One of the games in Math Garden—the “mole rat task”—was used in this study as an additional measure of working memory.

(7)

hands and their following guesses (if first player), or their order of their guesses (if second player), were predetermined. This standardization eliminated the possible attribution of differences in ToM to differences in used strategies by the experimenter.

At fixed moments during the experiment the participants were asked, through struc-tured interviews, to elaborate on for instance the rules of the game, possible strategies, or specific play behavior of themselves or the experimenter. These interviews not only encour-aged the participants to actively reason about the game in strategic terms, but also enabled the experimenters to systematically determine the participants’ orders of reasoning through the analysis of audio recordings of these interviews.

The experiment was divided into different phases. In each phase the game was played one or more times. Each game existed of either six rounds or only a solitary last round. The first phase was designed to familiarize participants with the game, and to ascertain that the rules were well understood. During this phase participants were invited to ask questions concerning the rules of the game. The experimenters were instructed to only use zeroth-order strategies. In the final experimental design experimenters explicated their reasoning during this first phase.

The following phase was designed to make sure participants got a more advanced comprehension of the game. This phase facilitated participants towards the understanding that the guess of the first player might reveal information about the number of coins in her or his hand. The experimenters were instructed to show this in obvious situations (e.g., by guessing four if the experimenter had two coins in her or his hand).

Subsequently a phase was designed to create conditions for which deception is obvi-ously viable. This phase consisted mostly of games of which only the last round was played, i.e., the participants were told to have collected one point more than the experimenter during the first five rounds, and now needed to consolidate their win. The participants were challenged to think of ways to prevent giving away information, e.g., by deceiving the experimenter. The experimenters were instructed to use first-order strategies, which served as an example to the participants.

Finally, a phase was designed to create conditions for which anticipating deception is viable. This phase also consisted mostly of games of which only the last round was played, i.e., the participants were told that the experimenter had collected one point more than the participant during the first five rounds, and now needed to prevent the experimenter from winning the game. Basically this was a mirrored version of the previous phase. However since the participants did not have a strong incentive to use a second-order strategy (i.e., there was no opportunity to win the game, only to prevent the experimenter from winning), they were instructed that they would receive a sticker by playing a tie (except in the first experimental design). The experimenters were instructed to use first-order strategies in or-der to challenge participants to anticipate the experimenters’ deception. The experimenters took the role of a second-order player only sporadic since it fails to serve as an example and actually enhances the advantageousness of a zeroth-order strategy.

Assessment of Theory of Mind. ToM was assessed through ratings of the struc-tured interviews. A scoring scheme was developed to rate the children’s expressions (Ap-pendix B, in Dutch). Initially nine categories of responses were created, which were com-bined for the analyses to allow for a zeroth-, first-, and second-order category, and a residual—‘unknown order’—category for exclusion. The audio recordings were rated by

(8)

five experimenters. These coders were assigned to participants in a fully-crossed design (Hallgren, 2012). A subset of the participants was rated by all coders, and the remaining participants were rated by different subsets of the coders, see Table 1.

Table 1

Schematic Assignment of Coders to Participants

Participant subset Coder

A B C D E 1 X X X X X 2 X X 3 X X 4 X X 5 X X 6 X X

Note. ‘X’ denotes the assignment of the coder to the subset of participants.

Six participants were used as a training set during the developmental phase of the scoring scheme. Another three participants were used as a final training set. Participants that were rated by more than two coders were assigned the most frequent rating. If two or more ratings were most frequent, participants were randomly assigned one of those most frequent ratings. Participants that were rated by two coders were randomly assigned one of both coders’ ratings.

Controls. Working memory and numeracy were administered as controls. Working memory was assessed using the forward and backward digit-span tasks from the WISC-III-NL (Kort et al., 2002). Verbal working memory in general, and the digit-span task in specific, is used in studies of ToM both using a false belief test (e.g., Carlson, Moses, & Breton, 2002) and using a strategic game (e.g., Raijmakers et al., 2013). The choice for the digit-span task was motivated by (a) the comparability with other ToM studies that use these tasks, (b) the speed of administration, (c) the similarity in administration with the n-coin game (during the n-coin game, subjects are asked to verbally express their— reasoning about—strategies), and (d) the use of numbers under 10, which is also apparent in the n-coin game.

Additionally working memory was assessed using the “mole rat task”, a game in Math Garden. The game shows a matrix of molehills. In each trial a number of mole rats appear and disappear in a sequential order. The participants need to recount the order of appearance either forward or backward. The size of the field, the number of mole rats, and the sequence in which to recount the appearances (i.e., forward or backward) varies from trial to trial. As the game is computer-adaptive, the difficulty of the game is adapted to the skill of the participant. The ratings for this task of participants that played less than 30 trials of the game were removed, since I deem these ratings unreliable.

Numeracy was assessed using six simple summation and subtraction problems con-taining the numbers 0 to 4, e.g., “1 − 0 = . . . ” and “3 + 1 = . . . ”. Participants that failed on two or more of these problems were excluded from analyses. I argue that categorization of the replies of participants that fail on two or more of these problems is unreliable, since

(9)

having difficulty with these problems may result in unreliable game behavior.

Exit-interview. Finally a short survey was conducted. This survey included ques-tions about demographics such as age, gender, grade, and mother tongue, and quesques-tions about the game such as familiarity with the game or other similar games, attitude towards the game and stickers, and motivation to win the game and stickers.

Procedure. The research was conducted at the schools. Test areas consisted of

empty classrooms, staff rooms, a library, and a gym. Due to the limited number of test areas some of these were shared by experimenters. The interference of sight and noise was reduced by portioning off participants. Participants were picked up at the classroom and brought to a nearby test area. One of the experimenters introduced them to the research procedure and explained the rules, along with a demonstration of two example rounds of the game by two other experimenters. Each experimenter then took a participant to her or his test area and the procedure was finished in pairs.

At the test area the experimenter and participant were seated face-to-face at a table. Both the experimenter and participant received a small bag and two buttons. The buttons were placed on the bag, such that the participant could convince her- or himself that both players had the same amount of buttons. On the middle of the table was a small sheet showing the number of rounds won by the experimenter and participant. The sheet contained two rows of six boxes, which represented the number of rounds for each player. The player that won a round received a marker in the corresponding box.

The experimenters introduced themselves and explained the procedures of the n-coin game. Each game the participant won she or he received a big sticker. Different stickers were used for female and male participants. The sticker was put into a envelope with the child’s name on it. Subsequently the digit-span task, numeracy task, and exit-interview were orally administered. Finally each participant received an additional five small stickers in her or his envelope and was told that she or he would receive the envelope from the teacher at the end of the school day.

The research was conducted in five days spread over two weeks. The complete pro-cedure for each participant took 30 to 60 minutes, depending on the participant. Roughly 30 rounds of the game were played within this time-frame.

The mole rat task was administered in the two weeks after the two test weeks. The schools’ representatives were sent instructions on the task and were asked to distribute these instructions to the teachers. Teachers were instructed to only instruct participants with access to Math Garden and permission to link the results of the n-coin game to the results of the mole rat task. Participants were instructed to play the mole rat task two times for 10 minutes each. Since only a small number of participants played the mole rat task during the first week, the schools’ representatives were instructed to remind the teachers of the task at the end of the first week.

Results

In total 81 participants were tested. Of those participants nine were used in the training sets, eight had at least two wrong answers on the numeracy task, four did not run the full experiment (e.g., controls and exit-interviews were not administered), and one participant’s age was not recorded. Since some participants fell in more than one of these categories, the structured interviews of in total 65 participants could be rated. None of the

(10)

participants were excluded on the basis of the exit-interview. However one more participant was excluded if the ratings by four of five coders were used, or three more participants were excluded if the ratings by all five coders were used. Since these participants could not be reliably rated, they were excluded from the analyses. Starting from the ratings by four of five coders, 32 were male and 32 were female, aged 6 to 10 (µ = 7.50, σ = 1.01). Gender was equally distributed across schools (χ2(1) = 0, p = 1), grades (χ2(2) = 2.50, p = .286), and age (χ2(4) = 2.79, p = .594).

The parent(s) or carer(s) of two participants refused the connection of the results from the experiment with the results from the (third-party) mole rat task. In total 23 participants played the mole rat task. Of those participants three were used in the training sets, and four played the game less than 30 times. Since some participants fell in more than one of these categories, the results of in total 17 participants could be used with respect to this measure. Unfortunately, this number is too small to use as a control, and therefore only the correlation between the two measures of working memory were assessed. Both measures are normally distributed according to the Shapiro-Wilk normality test. The digit-span task was positively related to the mole rat task, r = .53, p = .029.

Prior to the main analyses the inter-rater reliability was assessed. Twenty participants were rated by all coders, and each coder rated an additional 18 participants. Half of those additional participants overlapped with one other coder, and the other half overlapped with another coder (see Table 1). Krippendorff’s α (see Table 2) and different variants of κ (see Table 3) were used to judge inter-rater reliability. Krippendorff’s α was chosen for its high flexibility. It can deal with any number of coders, all levels of measurement, and with missing data, thereby taking away some of the problems of other measures of inter-rater reliability (Hayes & Krippendorff, 2007). Nonetheless, also the more frequently used κ is reported to increase comparability with other studies. Besides the standard Cohen’s κ, the more conservative Byrt’s κ and Siegel’s κ are reported, since those correct for respectively prevalence and bias.

Both Table 2 and 3 show the reliability measures of all original categories, as well as a decreased number of categories. The reliability measures of all original categories were included as a measure of the whole instrument, however only the pooled categories are of current interest, since those are equivalent with respect to the goals of this experiment. These pooled categories are ‘unknown order’ (original categories 1 to 4), ‘zeroth-order’ (original categories 5 and 6), ‘first-order’ (original categories 7 and 8), and ‘second-order’ (original category 9). Moreover, both tables also show the reliability measures of the ratings by only four of five coders. The ratings of one of the five coders was excluded, since it substantially decreased reliability.

Finally, Table 2 shows the lower and upper boundaries of the bootstrapped 95%

confidence intervals of Krippendorff’s α. The intervals were obtained using R-package

kripp.boot() (Gruszczynski, 2013).

If the original categories are retained and all participants included (thus both parti-cipants that were rated by two raters and partiparti-cipants that were rated by all five raters), a poor α of .61 is obtained (Krippendorff, 2004). However taking into account the more meaningful collapsed categories, and excluding one coder’s ratings, a moderate α of .79 is obtained. With respect to the κ’s—which only cover the participants rated by all coders— mostly moderate to high inter-rater reliabilities were found.

(11)

Table 2

Inter-Rater Reliabilities: Krippendorff’s Alpha

Krippendorff’s α Lower Upper

All coders, all participants, original ratings .61 .47 .72

All coders, all participants, aggregated ratingsa .68 .55 .80

Four codersb, all participants, original ratings .67 .53 .79

Four coders, all participants, aggregated ratings .79 .66 .92

aThe ratings 1 to 4, 5 and 6, and 7 and 8 were taken together, such that four new ratings arose

(i.e., exclusion, zeroth-order, first-order, and second-order).

bThe ratings of one of the five coders were excluded, since these substantially decreased the

relia-bility. Table 3

Inter-Rater Reliabilities: Kappa (Cohen’s, Byrt’s, and Siegel’s)a

Cohen’s κ Byrt’s κ Siegel’s κ

All coders, participants subsetb, original ratings .86 .72 .77

All coders, participants subset, aggregated ratingsc .88 .76 .80

Four codersd, participants subset, original ratings .84 .68 .74

Four coders, participants subset, aggregated ratings .87 .73 .77

aFor each variant of kappa, the arithmetic mean of all coder-pair kappa’s was calculated to allow

for more than two coders (i.e., Light’s kappa). Byrt’s kappa corrects for prevalence, and Siegel’s kappa corrects for bias.

bOnly participants that were rated by all coders (i.e., all four or all five) were included.

cThe ratings 1 to 4, 5 and 6, and 7 and 8 were taken together, such that four new ratings arose

(i.e., exclusion, zeroth-order, first-order, and second-order).

dThe ratings of one of the five coders were excluded, since these substantially decreased the

relia-bility.

Since a moderate inter-rater reliability could be established, the ratings are used to test the effect of age while controlling for working memory. Table 4 shows the distribution of ToM and age, for both four coders and five coders. The main analyses were run on the children that were given a definite estimate of their ToM, i.e., zeroth-, first-, or second-order. Children that could not reliably be rated, i.e., ‘unknown order’, were excluded from the analyses. The ratings by the four instead of five coders were used, since those ratings had the highest reliability. A multinomial logistic regression with forced entry method was used to test for age and working memory. Table 5 summarizes the results. The analyses are followed by a robustness check on the data of all five coders, and excluding the single 10-year-old participant.

Table 5 shows that no multicollinearity is detected, since all standard errors are below 2. Age and working memory explain a significant amount of variability in the orders

of ToM (LRT = 12.83, p = .012). The effect size is small however, McFadden R2 = .10.

Unfortunately, neither age nor working memory predicted whether a participant had first-order or zeroth-first-order ToM, respectively b = .47, p = .30, and b = .23, p = .23. Nevertheless working memory did predict whether a participant had second-order ToM or zeroth-order

(12)

Table 4

Frequencies of Order of Reasoning for Each Age Group with Four and Five Coders

Order of reasoning Age Total order

6 7 8 9 10 Four coders Unknown order 1 0 0 0 0 1 Zeroth-order 6 7 6 2 0 21 First-order 1 3 3 1 0 8 Second-order 3 15 8 8 1 35 Five coders Unknown order 2 1 0 0 0 3 Zeroth-order 5 7 7 2 0 21 First-order 1 3 2 1 1 8 Second-order 3 14 8 8 0 33 Total age 11 25 17 11 1 65

ToM, b = .30, p = .048. The odds ratio of 1.35 shows that as working memory increases the participants are more likely to use second-order strategies than zeroth-order strategies. Although age shows an equivalent odds ratio, it did not predict whether participants had second-order ToM or zeroth-order ToM, b = .34, p = .290, due to the high standard error.

To see whether these results are robust, the same analyses are run on the data of all five coders, the data with gender included, and the data with the only 10-year-old participant excluded. If the 10-year-old participant is excluded, the model still explains a significant amount of variance (LRT = 13.29, p = .010), and the effect of working memory becomes slightly stronger (p = .037). With five coders the model does not fit, neither with the 10-year-old participant included (LRT = 9.19, p = .057) nor with the 10-year-old participant excluded (LRT = 7.65, p = .105). If gender is included in the model, it still explains a significant amount of variance (LRT = 14.34, p = .026), however gender itself Table 5

Multinomial Logistic Regression of Age and Working Memory on Order of Reasoning

B (SE) Lower Odds Ratio Upper

First-order vs zeroth-order

Intercept −1.76 (3.60)

Age 0.47 (0.45) 0.66 1.60 3.90

Working memory (digit-span) −0.28 (0.23) 0.48 0.76 1.19

Second-order vs zeroth-order

Intercept −5.17 (2.50)*

Age .34 (0.32) 0.75 1.40 2.62

Working memory (digit-span) .30 (0.15)* 1.00 1.35 1.82

(13)

does not predict ToM. Discussion

The main objective of this experiment was to investigate whether the n-coin game is suitable for the identification of first- and second-order reasoning with children. By rating children’s expressions in structured interviews, the game indeed allows for a categorization of children into zeroth-, first-, and second-order reasoning. Also, this categorization can be obtained with moderate reliability. Unfortunately, in its current form it does not yet allow for all children to be categorized into one of the orders of reasoning. Also, the categorization does not agree with the natural succession of order of reasoning with age, i.e., the order of reasoning does not show to increase with age. However, the n-coin game does identify second-order reasoning at the age of six, which agrees with findings on the Sally-Anne test (Perner & Wimmer, 1985) and strategic games (Flobbe et al., 2008).

Contrary to the expectations, the number of children capable of higher-order reasoning does not increase with age. I argue that the age range or number of participants used in this study is inadequate for the purpose of comparing age-differences. The age range was specifically chosen for a rich variety of orders of reasoning, which is to be expected for 6-to 9-year-old children. With such high variety, only a relatively large sample might be sufficient to detect differences in age. With a relatively small sample size such as in this study, it might have been safer to choose a wider range, for instance by selecting 4- 7-and 10-year-old children. With the above in mind, I am cautious with concluding that age does not predict the order of reasoning in the n-coin game. It is however safe to conclude that the age-range of 6 to 9 is indeed important in the development of ToM.

Although the number of children capable of higher-order reasoning was not found to increase with age, it was found to increase with working memory. This agrees with findings by Flobbe et al. (2008) and Raijmakers et al. (2013) that working memory to an important degree predicts ToM. Nonetheless, again I am cautious to give this finding too much value. First, it only seems to predict an increase from zeroth- to second-order reasoning, and even then the effect size is low. Second, the effect might even disappear if participants get the opportunity to play the game more frequent. After all, the limited available time in the current experiment and thus limited opportunity for the children to gain experience, possibly mainly allowed children with a relatively large working memory capacity to fully explore all possible strategies.

Another thing to note is the failure to give each child a definitive rating of order of reasoning. Since the ratings depend purely on verbal expressions, some children with difficulties expressing themselves can thus not be reliably rated. For instance, one child was observed that did not answer to any of the questions, and more frequently children were observed that were able to provide a correct answer, but were unable to give any substantiation. However, with respect to the current study this problem does not pose a serious threat, since it pertains to only a few participants (see Table 4). Moreover, it is a problem of the verbal assessment of ToM, not of the game itself. The ultimate goal, assessing ToM through the behavioral responses in the game, does not deal with this problem.

A few other issues and observations are also noteworthy. First of all, the sensitivity of the test might increase with age. This applies for a lesser extent to the behavioral strategies, but for a larger extent to the structured interviews. Some of the questions might be more

(14)

difficult to understand for younger children, and might also be harder to answer. Although much effort was put into increasing the sensitivity of the experiment, it is possible that the number of higher-order children is biased for the lower age-groups. The fact that most “unknown orders" are in the lower age-groups reflects this (see Table 4). Nonetheless, this has no consequences for the results of the current study, since a higher sensitivity would only weaken an effect of age, and already no effect of age was found.

Then there are a couple of observations from during the experiments. For instance, one child was observed to explain deception very well, both theoretically as in terms of the game, but consistently did not show it behaviorally during the game (the reverse is possible as well). This could pose a serious threat to the use of solely the behavioral strategies to determine the order of ToM, since although higher-order reasoning might be present, it might not always be shown. Although the child could not indicate why he or she did not use the strategy, it might be argued that a sense of unfairness associated with deception could obstruct its use. Also, the incentive might not be sufficient or the goals of the game might not be clear enough. In Experiment 2 extra care was put into the communication of clear goals and a proper incentive. Moreover, since the game is played against a synthetic participant, feelings of unfairness might play less of a role.

Another child was observed to use the first-order deception strategy in the beginning of the experiment but somehow stopped using it. Like the previous issue, this one could also pose a threat to the use of solely the behavioral strategies to determine the order of ToM. It might be that the child just fooled around during the first games and used the strategy by accident, but she or he also might have somehow unlearned to use the strategy. The first option does not pose a threat, since it is expected that a child that fools around also uses nonsensical strategies and therefore will not be assigned the first- or second-order reasoning category with confidence. If the child unlearned the strategy, one possibility is that the design of the current experiment somehow taught the child not to use the deception strategy. This is however unlikely, since different phases of the experiment were actually designed to create the perfect conditions to use deception, and children were actively encouraged to think about the game in such strategic terms. Another possibility is that the child could not find a way to use deception to a strategic advantage. This might be a more serious problem, and is discussed in the discussion section of Experiment 2.

A few changes and additions to the experiment are proposed that might pay off in future studies. Although ultimately the n-coin game should be capable of assessing ToM through solely observing the strategic choices during the game, a measure of verbal skill should be included if again ratings from structured interviews are used. Although during the experiment it was observed that most children within the age-range were capable of expressing their thoughts about strategies in the game, it was also observed that some had difficulties expressing everything they wanted to share with the experimenter. Also, inhibition could be added as a control. Inhibition has shown to be an important aspect in ToM development (e.g., Carlson, Moses, & Breton, 2002; Carlson, Moses, & Claxton, 2004). Although its importance is based on false belief tests, it is likely that its role is just as important in strategic games. For one, the dominant response in the n-coin game is likely a zeroth-order strategy. To use a first- or second-order strategy, a subject needs to inhibit this response and choose a strategy that from a zeroth-order point of view results in a sure loss. Second, delayed gratification, which is also related to inhibition, might play a

(15)

role in enabling a deceptive strategy, since a deceptive strategy results in a sure short term loss, but might result in a long term win. Finally loss aversion and risk aversion might be influential. Working memory and numeracy, the controls that were included in the current design, could be administered prior to the n-coin game instead of afterwards. Working memory, and to a lesser extent numeracy, might have been influenced by the game since both were appealed upon during the game. Although the effect is not considered large, it is possible that if numeracy is assessed beforehand, less children might be excluded for low numeracy scores. Although it is unclear what the effect for working memory would be, it will certainly serve as a better control. Finally, a comparison with other ToM tests could be added. Although different tests of ToM have shown to barely associate (Raijmakers et al., 2013), it could be nice to compare results from n-coin game to ToM ratings obtained by for instance a false belief test or a different strategic game.

Summarizing the above discussion, this experiment has successfully shown the suit-ability of the n-coin game in testing children’s ToM. In most cases, the children’s order of reasoning can be reliably rated from their expressions on the strategic play of the game. Although no differences with respect to age were found, I expect that a wider range of ages or larger number of participants will enable these differences. Moreover, although working memory did show to predict—second-order—ToM, the effect size is low and the effect might even disappear if participants get the opportunity to play the game more frequent.

Since the ultimate goal is to determine the order of reasoning solely from the behav-ioral strategies in the game—this will enable all advantages of a strategic game—a second experiment was done in order to develop a method for analyzing the behavioral data. To obtain a large amount of data, a synthetic opponent was created, and adult participants were asked to beat this opponent. Adult participants were used for two reasons. First, the implementation of the synthetic opponent was not suitable for young children, not only be-cause it was a simple text-based version of the game, but also bebe-cause it required prolonged attention. Second, since adults are assumed to be able to use second-order strategies, it could be tested whether this computerized method indeed allowed the determination of second-order ToM (i.e., for this method to be suitable it is necessary that it correctly iden-tifies all participants to be able to use second-order ToM). The next section discusses the experiment and its results.

Experiment 2 Methods

Participants. Five male and two female participants, aged 23 to 29 (µ = 26.29,

σ = 2.43), were recruited to play the n-coin game against a synthetic player. No selection

criterion was used. All participants were highly educated (six had finished a university degree and one had finished an applied university degree). Participants were rewarded sweets.

Materials. The three-coin game as described in the introduction was implemented

in the free and open-source statistical programming software R 3.0.0 (R Core Team, 2013). The R-scripts can be found in Appendix C, D, and E and the author’s website (www.alexandersavi.nl), and are made available under GPL≥ 3. In this text-based R version of the game the synthetic opponent was called Sally. Sally was programmed to

(16)

use zeroth-order strategies if, during a game, she had collected an equal amount or less points than the participant. These strategies are indicated by a ‘0’ in Tables A1 and A2 in Appendix A). If Sally had more points than the participant during a game, she was programmed to use a first-order strategy (deception) with probability P = 1. These strate-gies are indicated by a ‘1’ in Table A2 in Appendix A. Sally was programmed this way for two important reasons. First, since Sally did never use a second-order strategy (anticipat-ing deception), it was always beneficial for the participant to use the first-order strategy deception. Second, since Sally used first-order strategies at predictable moments with prob-ability P = 1, it was also beneficial for the participant to use the second-order strategy anticipating deception at predictable moments. Implementing Sally this way created the conditions for the participants to actually use first- and second-order strategies, since using those adequately gave them a strategic advantage3.

Pilots of this experiment showed that participants had difficulty to engage in strategic thinking by solely playing the game. Effort was put into very clear instructions, and multiple nudges were implemented to facilitate participants with thinking strategically about the game. First, an incentive to seriously play the game and try to beat Sally was introduced: each time the participant won a game it was awarded a sweet of her or his choice (i.e., chocolate or liquorice candy). Second, Sally’s first-order reasoning served as an example for the participant of how to strategically use deception. Third, clear instructions with clear goals were included, and could be consulted at any time during the experiment. Fourth, a number of control questions were administered before the actual start of the games against Sally. Besides questions about comprehension of the rules of the game, a question was added to see whether the participant understood in what situation a tie could be beneficial. Finally, the game included in-game encouragements and subtle hints, such as “You have more points than Sally, try to keep it that way!”, which also subtly shows that both a tie and winning that round would be beneficial.

The experiment consisted of 20 games with each eight rounds. At the start of each game the player who could make the first guess was assigned randomly. Each following round the players took turns in making the first guess. If Sally was the first to guess, the participant was not allowed to guess the same sum. If this nonetheless happened a warning message was shown and the participant was asked to make a new guess. At the beginning of each round the participant was reminded of the number of points gained so far. At the end of each game the participant was reminded of the number of times she or he had won from Sally.

Assessment of Theory of Mind. The strategies the participants used were ex-tracted from the observed data using Table A1 and A2. Since I assume that first-order strategies can only be used if one has first-order ToM or higher and second-order strategies can only be used if one has second-order ToM or higher, the observed relative frequencies of those strategies give a first indication of the order of reasoning of a participant. In order to make sure that the frequent use of first- and second-order strategies cannot be attributed to ignorance of the rules or inattentive play, the use of nonsensical strategies is also observed. I assume that when nonsensical strategies are observed, the frequent use of first- and/or second-order strategies does not indicate ToM.

3

(17)

Maximum likelihood estimation (MLE) was used to assess whether the observed rel-ative frequencies of using first-order, second-order, or nonsensical strategies differed from zero significantly. Given the number of coins in the participant’s hand and, if applicable, the guess of the opponent, the probabilities for using either a first-order, second-order, or nonsensical strategy are calculated given a theoretical model (this model is later defined). The discrepancy between the prediction of the model and the actual strategy of a partic-ipant is quantified by the negative log-likelihood (−LL). The −LL is calculated by first recording the theoretical probability Pst,t that corresponds to the actual participant’s used

strategy st for each round t. Subsequently the logs of the recorded probabilities of each round are taken, summed together, and the result is multiplied by −1 (Equation 1).

−LL = −1X

t

log Pst,t (1)

The smallest −LL corresponds to the least discrepancy. Discrepancies are minimized using the R package bbmle (Bolker & Team, 2012). By comparing likelihood estimates with and without the estimation of one of the parameters, the significance of that parameter is assessed (i.e., if the prediction of the participants’ strategies does not significantly improve if for instance the probability of using a second-order strategy is estimated rather than set to zero, I argue that the participant cannot be reliably determined to be able to use second-order strategies).

A few things should be noted about the above method. First, the theoretical model used to obtain the probabilities for each of the strategies is the most trivial conceivable. The parameters of this model are assumed to be equal to the probabilities that it produces for each strategy. The estimated probabilities thus correspond to the relative frequencies of the different observed strategies. The probability of using a first- or second-order strategy is however naturally different from the probability of a participant being able to use first-or second-first-order ToM. A mfirst-ore sophisticated model is discussed in the discussion section, however since the current method does not intent to explain how the strategies of partici-pants come about, and moreover first- and second-order strategies can theoretically only be used if one has first- or second-order ToM, the current trivial model is deemed appropriate. Second, the estimated probability (or respective relative frequency) of using a first-order strategy, is given that the participant is the first one to guess the sum of coins. Likewise, the estimated probability of using a second-order strategy or nonsensical strategy, is given that the participant is the second one to guess the sum of coins. Finally, as explained previously, in the current study I assume that participants that do not use nonsensical strategies, but do use first-order or second-order strategies, can be categorized respectively.

Exit-interview. Finally a short survey was conducted. This survey included ques-tions about demographics such as birth date, gender, and educational level, and quesques-tions about the game such as familiarity with the game or other similar games, possible applicable strategies in the game, and observations about the play behavior of Sally.

Procedure. After recruitment the participants were put at a table with a laptop,

the instructions and control questions, the exit-interview, and different types of sweets. The participants were asked to read the instructions carefully and answer the control questions subsequently. The answers on the control questions were immediately checked by the ex-perimenter, and possible misunderstandings were discussed with the participant, until it

(18)

was made sure that the participant understood all rules and had no further questions. The participants were then asked to play the game, and were alerted that they could consult the instructions anytime during the game. The participants were instructed to try to win as much games from Sally as possible. The game was played from the console of Rstudio (Racine, 2012). The experiment existed of 20 games with each eight rounds. In total the experiment took about 60 minutes per participant.

After the participants finished the experiment they were asked to fill in the exit-interview. They were then notified about the purpose of the study and were rewarded a sweet for each game they had won.

Results

Maximum Likelihood Estimation. Table 6 shows both the relative frequencies

of the different strategies as observed in the participants’ data, and the maximum likelihood estimates of the probabilities of those strategies. The reported standard errors result from a quadratic approximation of the curvature at the maximum likelihood estimate (Bolker & Team, 2012). The z values and p values that indicate whether each strategy significantly increases the accuracy of the prediction (i.e., whether the estimates differ from zero), are based on the standard errors and the assumption that the likelihood functions are quadratic (Bolker & Team, 2012). Notice that, as a prove of method, the relative frequencies and likelihood estimates correspond.

As expected, all participants significantly use second-order strategies. Also, none of the participants use nonsensical strategies, thus the use of second-order strategies need not be attributed to ignorance of the rules or inattentive play. As I assume that all partici-pants have second-order ToM, this method correctly identifies the order of reasoning of each participant. At the same time, one participant fails to significantly use first-order strate-gies. On the one hand, this does not pose an immediate problem, since the participant is nonetheless correctly identified. It does however reveal that although ToM is required, it is insufficient to understand the benefits of deception. In the discussion it is argued that some form of strategic reasoning is necessary in addition to ToM.

Development of Strategy Use. Another approach is to assess the development

of strategy use over time. Figure 1 and 2 show, for each participant, the relative frequencies of the different strategies for 4 bins of each 5 games. As participants gradually learn to play the game, and get to know their synthetic opponent, first- and second-order strategies are expected to increase, while nonsensical strategies are expected to either decrease or be non-existent at all.

Figure 1 and 2 show that nonsensical strategies are almost non-existent. It is however interesting to notice that four out of seven participants did use the strategy in the last 10 games they played. The use of first-order strategies was substantial for most participants, except for participant 2. Also interesting is participant 5, who shows a sharp decrease in using the strategy in the last 10 games. Finally the use of second-order strategies was substantial, however to a lesser extent than first-order strategies. If only the last bin is observed, all participants except participant 2 use first-order and second-order strategies substantially, and use the nonsensical strategy only occasional—if at all.

(19)

Table 6

Relative Frequencies and Maximum Likelihood Estimates of First-Order-, Second-Order-, and Nonsensical Strategies for Each Participant

Participant Strategy Relative Frequency Maximum Likelihood Estimationa

Estimate (SE ) z Value p Value

1 First-order .825 .825 (.042)* 19.419 < .001 Second-order .413 .412 (.055)* 7.495 < .001 Nonsensical .013 .013 (.012) 1.006 .314 2 First-order .013 .013 (.012) 1.006 .314 Second-order .088 .088 (.032)* 2.770 .006 Nonsensical .038 .038 (.021) 1.766 .077 3 First-order .363 .362 (.054)* 6.745 < .001 Second-order .363 .363 (.054)* 6.745 < .001 Nonsensical .038 .038 (.021) 1.766 .077 4 First-order .638 .637 (.054)* 11.861 < .001 Second-order .400 .400 (.055)* 7.303 < .001 Nonsensical .025 .025 (.017) 1.432 .152 5 First-order .550 .550 (.056)* 9.888 < .001 Second-order .088 .088 (.032)* 2.770 .006 Nonsensical .025 .025 (.017) 1.432 .152 6 First-order .238 .238 (.048)* 4.992 < .001 Second-order .050 .050 (.024)* 2.052 .040 Nonsensical .013 .013 (.012) 1.006 .314 7 First-order .375 .375 (.054)* 6.930 < .001 Second-order .213 .213 (.046)* 4.650 < .001 Nonsensical .013 .013 (.012) 1.006 .314

aThe test statistic and corresponding p value indicate whether estimating the parameter results in

a better prediction of the observed data than when fixing the parameter to zero. *p < .05.

(20)

Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(a) Relative frequencies of strategies (partici-pant 1). Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(b) Relative frequencies of strategies (partici-pant 2). Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(c) Relative frequencies of strategies (partici-pant 3). Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(d) Relative frequencies of strategies (partici-pant 4).

Figure 1 . Relative frequencies of first-order, second-order, and nonsensical strategies for 4

bins of each 5 games (participant 1 to 4). Relative frequencies of using a first-order strategy are given that the participant is the first one to guess the sum of coins. Relative frequencies of using a second-order strategy or nonsensical strategy are given that the participant is the second one to guess the sum of coins.

(21)

Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(a) Relative frequencies of strategies (partici-pant 5). Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(b) Relative frequencies of strategies (partici-pant 6). Games (binned) Relative frequency 1:5 6:10 11:15 16:20 0.0 0.2 0.4 0.6 0.8 1.0 Strategy First-order Second-order Nonsensical

(c) Relative frequencies of strategies (partici-pant 7).

Figure 2 . Relative frequencies of first-order, second-order, and nonsensical strategies for 4

bins of each 5 games (participant 5 to 7). Relative frequencies of using a first-order strategy are given that the participant is the first one to guess the sum of coins. Relative frequencies of using a second-order strategy or nonsensical strategy are given that the participant is the second one to guess the sum of coins.

(22)

Discussion

The main objective of this second experiment was to propose and assess a method to determine someone’s order of reasoning on the basis of the used strategies in the n-coin game. Maximum likelihood estimates of the most trivial model conceivable (i.e., the probabilities of using a certain strategy are directly determined by those exact same probabilities) reliably determined the participants’ order of reasoning. As argued before, of course the assumption that the probability of using a certain strategy is equal to probability that one is able to use the corresponding order of reasoning does not hold. Nonetheless I argue that, since the use of a certain strategy requires the corresponding order of reasoning, determining whether someone uses such strategy is sufficient to conclude that she or he is able to use the corresponding order of reasoning. That is, if one does not use nonsensical strategies as well. It is therefore tentatively concluded that the proposed method, although trivial, seems successful in determining someone’s order of reasoning.

Although the proposed method and model are sufficient to determine someone’s order of reasoning, it does not explain how the used strategies come about. To allow for such explanation, maximum likelihood estimation might still be used as a method, but the trivial model should be replaced by a more sophisticated one. Such a model might describe how the expected utility of each possible strategy is determined in each round of a game. These expected utilities are dependent on the order of reasoning of the participant, the expected order of reasoning of the opponent, and the second order expected order of reasoning of the participant (i.e., what the participant expects that the opponent expects which order of reasoning the participant has). Which of these three dependencies apply to the used strategies of a participant determines her or his order of reasoning. Since someone’s used strategies might also dependent on for instance a suspected regularity in the coins the opponent takes in her or his hand, a practical suggestion is to randomize the number of coins a participant and the opponent receive4.

Careful observation of the results reveals that first- and second-order reasoning do not provide for a strategic advantage in isolation. Indeed, a deceptive strategy (which requires first-order reasoning) only pays off with certainty if the opponent does not anticipate deception. And even then it only pays off in the long term since it results in a certain short

term loss. Vice versa, an anticipating deception strategy (which requires second-order

reasoning) may only pay off if the opponent uses a deceptive strategy. To gain a strategic advantage it is thus essential to understand that whether deception or anticipating deception

pays off, it greatly depends on how the other player deploys her or his strategies. To

illustrate this issue a few examples are given. One participant, as pointed out in the results section, used deception very infrequently, but did anticipate deception of the opponent. One explanation might be that this participant was not able to use higher-order reasoning, but used second-order strategies to simply counter the sometimes ‘strange behavior’ (i.e., first-order guesses) of the opponent. An alternative and more likely explanation however is that the participant was perfectly able to use higher-order reasoning, but failed to understand how it could benefit her or him. Yet another participant was observed to frequently use

first-4Although in the current study the synthetic player received a random number of coins in her hand, the participant was free to choose a number of coins to take in her or his hand. Since the current study does not have the intention to explain how the strategies of participants come about, but solely intents to determine their order of reasoning, this has no implications for the used model.

(23)

and second-order strategies, and also correctly explain these strategies, but without being able to contrive at what point these strategies actually increased the strategic advantage. The participant was found to very frequently anticipate deception whereas the opponent never used deception. This way one also fails to use higher-order reasoning to ones benefit, and it may even become disadvantageous. Lack of the additional strategic reasoning that is required to successfully use higher-order reasoning may eventually lead to a decrease in the use of such strategies. One particular example is participant 5 who started to use deception, but greatly decreased the use of this strategy the more she or he played the game (see Figure 2(a)). Being unable to think of when to apply first- and second-order strategies, and thus not being able to take advantage of these strategies (or even use them to a disadvantage), might pose a threat to the actual use of the strategies. It might thus be argued that simply being able to use ToM does not provide for a strategic advantage in isolation, since one first needs to realize that it actually can be used to a strategic advantage. Moreover to realize this, one probably needs to understand how the first-order and second-order strategies need to be used. As a final remark, the additional required strategic reasoning might explain the behavioral gap that was described in the introduction. Nonetheless, the use of solely the behavioral responses in the n-coin game has a few important advantages. Most importantly it does not dependent on the verbal skills of participants. Moreover, participants of the first experiment seemed to be demotivated by the frequent interviews, whereas participants of the second experiment indicated to

be highly motivated. Of course, this difference might partly be attributed to the

age-differences between both samples. However it seemed that the uninterrupted play really motivated participants to beat the opponent. Naturally, a game comes into its own when it is played, not when it is talked about. Also the play behavior allows for an analysis of the development of strategy use over time. This might give valuable additional insight about children’s development of strategic reasoning. Finally if only play behavior is required, the

n-coin game is most easily administered.

General discussion

The results of both experiments strongly suggest that the n-coin game is suitable for testing children’s ToM. The first experiment showed that young children are capable of both learning to play the game in a short period of time, contriving and using strategies that require ToM, and expressing their strategic reasoning verbally. Since the ultimate goal is to assess ToM independent of verbal expressions, in the second experiment solely the play behavior was investigated. The second experiment showed that the order of ToM of adults can be accurately assessed using solely the play behavior. Also, contrary to the children in the first experiment, the adults did not need to be actively encouraged to reason about the game in strategic terms.

In order to understand whether children’s ToM can be accurately assessed using a similar treatment as the adults received, it might only be necessary to create a game envi-ronment more suitable for children. Indeed, the first experiment already showed that the

n-coin game is suitable in principle, whereas the second experiment showed the conditions of

the game for which strategic behavior may naturally arise (e.g., a strictly standardized envi-ronment, clear incentive, continuous play, an opponent that can be beaten, and a strategic advantage of each higher-order strategy over the one below that strategy).

Referenties

GERELATEERDE DOCUMENTEN

Figure 3.5 shows (a) proportion of correct answers to the second-order false belief questions at pre-test, post-test and follow-up sessions and (b) the difference in

It is as if one piece of the hierarchy is flattened, or skipped over in parsing.” (p. We may generalize children’s failures at first-order and second-order false belief

Chapter 5: The Role of Simple and Complex Working Memory Strategies in the Development of First-order False Belief Reasoning: A Computational Model of Transfer of Skills..

data were calculated based on the proportions under the assumption that there was no missing data. The number of repetitions of the DCCS and FB models at pre-test, training and

Based on our computational modeling approach that we presented in Chap- ter 2, we propose that even if children go through another conceptual change after they pass the

Het doel van deze modelleeraanpak was, naast het doen van exacte voorspellingen die empirisch getest kunnen worden, om een procedurele verklaring te geven voor

Five-year-olds’ systematic errors in second-order false belief tasks are due to first-order theory of mind strategy selection: A computational modeling study.. Frontiers

Linguistic control: Annesi Mehmet’in bodrumdaki doğum günü hediyesi yavru köpeği gördüğünü biliyor muymuş?. Adapted from Flobbe