• No results found

On the use of classical change scores in individual change assessment

N/A
N/A
Protected

Academic year: 2021

Share "On the use of classical change scores in individual change assessment"

Copied!
149
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Tilburg University

On the use of classical change scores in individual change assessment

Gu, Z.

Publication date: 2020

Document Version

Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Gu, Z. (2020). On the use of classical change scores in individual change assessment. Gildeprint.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal Take down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

(2)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

(3)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 2PDF page: 2PDF page: 2PDF page: 2

Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, micro-filming, and recording, or by any information storage and retrieval system, without written permission of the author. Printing was financially supported by Tilburg University.

(4)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 3PDF page: 3PDF page: 3PDF page: 3

On the Use of Classical Change Scores in Individual

Change Assessment

Proefschrift ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. K. Sijtsma, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de Aula van de Universiteit op

vrijdag 6 maart 2020 om 13.30 uur door

(5)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 4PDF page: 4PDF page: 4PDF page: 4

Promotor: Prof. dr. K. Sijtsma Copromotor: Dr. W.H.M. Emons

(6)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 5PDF page: 5PDF page: 5PDF page: 5

Table of Contents

Chapter 1: Introduction ... 1

Chapter 2: Review of Issues About Classical Change Scores: A Multilevel Modeling Perspective on Some Enduring Beliefs ... 5

2.1 Introduction ... 6

2.2 Classical Test Theory Model for Change Scores ... 7

2.3 Five Beliefs About Change Scores ... 8

2.4 A Multilevel Change-Score Framework ... 12

2.5 Negative Beliefs About Classical Change Scores Revisited: Changes Scores Can Be Useful ... 15

2.6 Discussion ... 25

2.7 Appendices ... 28

Chapter 3: Estimating Change-Score Reliability ... 35

3.1 Introduction ... 36 3.2 Theory ... 37 3.3 Method ... 41 3.4 Results ... 47 3.5 Discussion ... 59 3.6 Appendices ... 61

Chapter 4: Statistically Reliable Change: A Unified Treatment from the Classical Psychometric Perspective ... 65 4.1 Introduction ... 66 4.2 Theory ... 68 4.3 Method ... 79 4.4 Results ... 83 4.5 Discussion ... 88 4.6 Appendices ... 92

Chapter 5: Precision and Sample Size Requirements for Regression-Based Norming Methods for Change Scores ... 95

5.1 Introduction ... 96

5.2 Norming Methods for Change Scores ... 97

5.3 Method ... 103

5.4 Results ... 106

5.4 Discussion ... 114

(7)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 6PDF page: 6PDF page: 6PDF page: 6

Chapter 6. Epilogue ... 121

References ... 125

Summary ... 139

(8)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 7PDF page: 7PDF page: 7PDF page: 7

1

Chapter 1: Introduction

In psychological and clinical research and practice, one often uses psychological tests or questionnaires to monitor a person’s progress, such as decrease in anxiety level or depression level, and to provide personalized recommendations. For example, in psychological and psychiatric research, researchers often use the Social Phobia and Anxiety Inventory (SPAI; Turner, Beidel, Dancu, & Stanley, 1989) to assess a person’s change in anxiety level in social settings as a function of treatment for social phobia (Beidel, Turner, & Cooley, 1993). In clinical practice, clinicians often use the Patient Health Questionnaire (PHQ; Spitzer, Kroenke, Williams, & the Patient Health Questionnaire Primary Care Study Group, 1999) to monitor the progression of a patient’s depression level so as to provide effective treatment (Lowe, Unutzer, Callahan, Perkins, & Kroenke, 2004).

This dissertation concerns the simplest, yet commonest practice in change assessment at the individual level, which is inferring a person’s change from a change score obtained from a pretest-posttest design. In the past half-century, polarized opinions about the usefulness of change scores have created an uncomfortable dilemma for applied psychologists and clinicians. From time to time, researchers are taught by some that change scores should not be used at all (e.g., Bereiter, 1963; Cronbach & Furby, 1970; Linn & Slinde, 1977; Lord, 1963; O’Connor, 1972), but they are also taught by others that change scores may be useful (e.g., Overall & Woodward, 1975; Rogosa & Willett, 1983; Williams & Zimmerman, 1996; Zimmerman & Williams, 1982a, 1982b). As a result, empirical researchers may avoid change scores (e.g., Denney, Rapport, & Chung, 2005; Finney, Moos, & Mewborn, 1980; Kim & Camilli, 2014; Raaijmakers, 2016; Sandell & Wilczek, 2016; Son & Morrison, 2010; B. J. Williams & Kaufmann, 2012) or they may faithfully apply them (e.g., Arana, Miracco, Galarregui, & Keegan, 2017; de Boer, Boon, Verheij, Donker, & Vermeiren, 2017; Esteso Orduña et al., 2017; Jacobson, Follette, & Revenstorf, 1984; Jacobson & Truax, 1991; Mallorquí-Bagué et al., 2017; Rizvi, Hughes, Hittman, & Vieira Oliveira, 2017; Singh & O’Brien, 2017).

(9)

change-537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 8PDF page: 8PDF page: 8PDF page: 8

2

score reliability, the methods for computing the reliability change index (RCI; e.g., Jacobson et al., 1984; Jacobson & Truax, 1991), and the interpretation of change scores by means of normed scores. This is the logic of the organization of this dissertation.

In this dissertation, from the CTT perspective, I provide answers to the following questions:

1. What are the major issues concerning the use of change scores? Are change scores useful after all?

2. How to estimate change-score reliability? Is there an alternative approach to estimating change-score reliability, which is better than the traditional one documented in psychometrics textbooks (e.g., Allen & Yen, 2002; Crocker & Algina, 2008; Furr & Bacharach, 2014; Lord & Novick, 1968)?

3. When assessing individual change in pretest-posttest settings, one often uses a method called the reliable change index (RCI; Jacobson et al., 1984; Jacobson & Truax, 1991), in which change scores are employed. How do we understand the theory of the RCI from the CTT perspective? Can we further improve the existing RCI methods? 4. Concerning the interpretation of change scores by means of regression-based norming

methods in practice, is there a unified framework for norming change scores? What are the precision and sample size requirements for norming change scores?

Outline of the dissertation

In Chapter 2, I discuss a CTT-based multilevel modeling framework for change scores. By using this framework, I systematically scrutinize important concepts relevant to change scores, including change-score reliability, measurement precision of change scores, the correlation between the pretest and the posttest scores, and the correlation between the pretest and the change scores, and to explain the interrelationships among the concepts. The multilevel framework convincingly renounced the following five negative beliefs about change scores, which are

- Belief 1: Change scores are unreliable.

- Belief 2: Change-score reliability tends to be lower than the reliability of separate test

scores; hence, use of change-score reliability is problematic if the reliability of either test score is low.

- Belief 3: If change-score reliability is low, then change measurement at the individual

level is rendered useless.

- Belief 4: Change-score reliability and the correlation between the pretest and the

(10)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 9PDF page: 9PDF page: 9PDF page: 9

3 - Belief 5: Change scores are inappropriate, because they typically correlate negatively

with pretest scores.

Perhaps the most widespread negative belief is that change scores are unreliable (i.e., Belief 1). The multilevel framework shows that change-score reliability is not intrinsically low, and that its value should be estimated anew in any empirical study and therefore should be treated as an empirical issue.

The next chapter discusses how one can estimate change-score reliability in empirical studies. Most psychometric textbooks (e.g., Allen & Yen, 2002, p. 210; Crocker & Algina, 2008, p. 149; Furr & Bacharach, 2014, p. 155; Lord & Novick, 1968, p. 76) recommend one method, which I refer to as the traditional method. The traditional method is defined as follows. Let !""# and !$$# denote the reliabilities of the pretest scores and the posttest scores, respectively. Let %"$ and %$$ denote the sample variances of the pretest scores and the posttest

scores, respectively. Let !"$ denote the sample correlation between the pretest scores and

posttest scores. Then the change-score reliability of a sample is !&&#=())#*)

+,(

++*++-$()+*)*+

*)+,*++-$()+*)*+ . (1.1) This method, albeit being used for at least half a century, is not the best method statistically speaking. Chapter 3 proposes an alternative method, referred to as the item-level method, which, as the simulation results suggested, has smaller bias and greater precision than the traditional method in various test settings.

(11)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 10PDF page: 10PDF page: 10PDF page: 10

4

borrowing information from other persons in the sample. Chapter 4 provides a unified treatment of the RCI from the CTT perspective, discusses four equations for computing the SEM, and shows that the equation based on the item-level method (discussed in Chapter 3) is the most suitable for computing the RCI. In addition, Chapter 4 also reports a generalization of the RCI to assessing profile change, which I refer to as RCI1234567.

(12)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 11PDF page: 11PDF page: 11PDF page: 11

5

Chapter 2: Review of Issues About Classical Change Scores: A

Multilevel Modeling Perspective on Some Enduring Beliefs

Abstract

Change scores obtained in pretest-posttest designs are important for evaluating treatment effectiveness and for assessing change of individual test scores in psychological research. However, over the years the use of change scores has raised much controversy. In this chapter, from a multilevel perspective, we provide a structured treatise on several persistent negative beliefs about change scores and show that these beliefs originated from the confounding of the effects of within-person change on change-score reliability and between-person change differences. We argue that psychometric properties of change scores, such as reliability and measurement precision, should be treated at suitable levels within a multilevel framework. We show that, if examined at the suitable levels with such a framework, the negative beliefs about change scores can be renounced convincingly. Finally, We summarize the conclusions about change scores to dispel the myths and to promote the potential and practical usefulness of change scores.

Keywords: change score, classical test theory, measurement precision, negative beliefs about change scores, reliability

Based on Gu, Z., Emons, W.H.M., & Sijtsma, K. (2018). Review of issues about classical change scores: a multilevel modeling perspective on some enduring beliefs. Psychometrika,

(13)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 12PDF page: 12PDF page: 12PDF page: 12

6

2.1 Introduction

Change scores obtained in pretest–posttest designs are important for evaluating treatment effectiveness (e.g., Jacobson, Follette, & Revenstorf, 1984; Jacobson & Truax, 1991; Ogles, Lunnen, & Bonesteel, 2001; Wise, 2004) but several authors have advised against the use of change scores. In their seminal paper “How We Should Measure ‘Change’: Or Should We?”, Cronbach and Furby (1970) claimed that raw change scores were rarely useful. Although several studies have suggested that Cronbach and Furby were too pessimistic (e.g., Overall & Woodward, 1975; Rogosa & Willett, 1983; Zimmerman & Williams, 1982a, 1982b), the bad reputation of change scores was eventually established, with a few pessimistic beliefs about change scores promoted by various researchers (e.g., Bereiter, 1963; Denney, Rapport, & Chung, 2005; Finney, Moos, & Mewborn, 1980; Kim & Camilli, 2014; Linn & Slinde, 1977; Lord, 1963; O’Connor, 1972; Raaijmakers, 2016; Sandell & Wilczek, 2016; Son & Morrison, 2010; Williams & Kaufmann, 2012).

We argue that these pessimistic beliefs are often based on incomplete, unsound, or incorrect reasoning, and that there is a future for the use of change scores. This chapter provides a structured treatise on five frequently reported and widespread beliefs about change scores in pretest–posttest designs. As a result of these beliefs, the abundant literature on change scores lacks consensus about the usefulness of change scores. Conclusions about change scores, even when they are not altogether wrong, have cultivated pessimistic beliefs that discouraged the application of change scores.

(14)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 13PDF page: 13PDF page: 13PDF page: 13

7 the usefulness of change scores because they ignored the difference between the within-person and the between-person levels. Throughout this chapter, we discuss change scores in the clinical setting, but change scores are also relevant in other research areas, including educational measurement, neuropsychology, and quality-of-life research. For example, in educational measurement learning progress may be monitored in longitudinal designs involving repeated measurement to assess students’ change.

In what follows, we first review the change-score literature in the context of classical test theory (CTT; Lord & Novick, 1968, pp. 55-172) and summarize five incomplete, unsound, or incorrect beliefs that have been firmly established for several decades and that have discouraged the use of change scores in practice. Next, we critically examine the five beliefs in the context of a multilevel framework. Finally, we present conclusions and thoughts on the correct use of change scores to rehabilitate the reputation of change scores.

2.2 Classical Test Theory Model for Change Scores

According to CTT, an observed test score for person 8 , denoted by 9:, can be

decomposed into a true score component, denoted by ;:, and a random measurement error component, denoted by <: (Lord & Novick, 1968, pp. 27-52)so that

9:= ;:+ <:. (2.1)

Let 9:" and 9:$ be the pretest score and the posttest score for person 8, then the vector containing the pretest score and the posttest score can be written as

>9:" 9:$? = > ;:" ;:$? + > <:" <:$?. (2.2)

For person 8, her change score is defined as

@:= 9:$− 9:"= (;:$− ;:") + (<:$− <:"). (2.3)

Reliability of change scores in the population, denoted by D&&# can be derived as D&&#=E))#F)

+,E

++#F++-$E)+F)F+

F)+,F

++-$E)+F)F+ , (2.4) where D""# and D$$# denote the reliability of the pretest and the posttest scores, H"$ and H$$ denote the variance of the pretest and the posttest scores, and D"$denotes the correlation between the pretest and the posttest scores (see, e.g., Linn & Slinde, 1977; Lord & Novick, 1968; Stanley, 1967). One may notice that Equation (2.4) assumes HI)I+= 0 ; that is, uncorrelated measurement errors between the pretest and the posttest.

(15)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 14PDF page: 14PDF page: 14PDF page: 14

8

under identical circumstances. Measurement precision is not necessarily the same for each individual. In practice, however, to estimate HIK, one relies on the error variance of the test score across persons, denoted by HIL, obtained from a single administration, which is equal to the average of the error variances within persons (Lord & Novick, 1968, p. 35). Consequently, in the practice of test use, measurement precision is assumed to be the same for each individual, but this is not an a priori assumption in CTT. The error variance of the test score across persons, HIL, is expressed by means of the standard error of measurement (SEM) of test score 9, that is , M<N = HIL= HOP1 − DOO#.

Measurement precision of the change scores, estimated by M<N&, is often obtained by

assuming equal SEM at pretest and posttest (Jabrayilov, Emons, & Sijtsma, 2016), which results in

M<N& = √2HIL= √2HO)T1 − DO)O)#. (2.5) In practice, M<N&is assumed to be the same for each individual.

2.3 Five Beliefs About Change Scores

We review five beliefs about change scores that have been accepted as being true for several decades. In later sections, we critically examine these beliefs by means of a multilevel framework so as to promote the usefulness of change scores.

Belief 1: Change scores are unreliable.

Historically, one of the main reasons why researchers (e.g., Bereiter, 1963; Cronbach & Furby, 1970; O’Connor, 1972) suggested not to use change scores is because of the widely accepted belief that change scores inevitably have low reliability. For example, Angoff claimed that “one highly disturbing characteristic of score changes is their extremely low reliability”(Angoff, 1984, p. 55). Although one might argue that in actual test use the reliability of change scores is an empirical issue, for a large part this belief was based on analytical results derived from Equation (2.4) (e.g., Linn & Slinde, 1977; Lord, 1963). Linn and Slinde (1977) assumed equal observed-score standard deviations at pretest and posttest (i.e., H"= H$)

because “the standard deviations of the pre- and post-measures are often of relatively similar magnitude” (ibid., p. 122), and assumed equal test score reliabilities (D""#= D$$#= D ), resulting in a simplified version of Equation (2.4), which is

D&&#=E-E)+

(16)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 15PDF page: 15PDF page: 15PDF page: 15

9 Based on Equation (2.6), Linn and Slinde (1977, p. 123) further presented numerical examples to show that D&&#is discouragingly low when D"$is large. For example, when D"$= .5 and D =.8, we find D&&#= .6, a value smaller than the reliability of the pretest and posttest scores. Linn and Slinde acknowledged that Equation (2.6) was a special case, but this did not refrain other researchers (e.g., Denney et al., 2005; Finney et al., 1980; Gjerustad & von Soest, 2012; Gold et al., 2013; Guo, Tompkins, Justice, & Petscher, 2014; Holahan & Moos, 1981; R. L. Linn & Haug, 2002; Parker & Dabros, 2012; Raaijmakers, 2016; Roohr, Liu, & Liu, 2016; Stevenson, Heiser, & Resing, 2013) from adopting the conclusion that the reliability of change scores is low.

This conclusion has greatly discouraged the use of change scores. Many researchers generally believe that, given low reliability is a characteristic of change scores, using change scores is certainly inadvisable, because low change-score reliability suggests that observed change scores are not highly correlated with true change scores, and hence, it makes no sense to infer true change from observed change. Because of this line of reasoning, Draheim, Hicks, and Engle (2016, p. 140) concluded that change scores “…in general have such low reliability that some researchers have advised against using them in any circumstance.”

Belief 2: Change-score reliability tends to be lower than the reliability of separate test scores; hence, use of change-score reliability is problematic if the reliability of either test score is low.

This belief has been widely adopted since the 1960s, when Lord (1963, p. 32) explicitly stated: “It is well known that the difference between two fallible measures is frequently much more fallible than either.” Other authors made similar statements (Cronbach & Furby, 1970; Linn & Haug, 2002; B. J. Williams & Kaufmann, 2012). The belief was fostered because researchers tried to derive psychometric properties of change scores from

D&&#=(E))#,E++#)/$-E)+

"-E)+ , (2.7) which is obtained by introducing the assumption H"= H$ (see, e.g., R. H. Williams &

(17)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 16PDF page: 16PDF page: 16PDF page: 16

10

separate test scores is low, then change-score reliability should definitely be lower, and therefore, change scores are not useful. Because of this line of reasoning, several researchers consider Belief 2 a fact, witnessed for example by B. J. Williams and Kaufmann (2012, pp. 886-887), who stated that “it has long been known (e.g., Lord, 1963) that difference scores generally have lower reliabilities than their component scores.”

Belief 3: If change-score reliability is low, then change measurement at the individual level is rendered useless.

Mellenbergh (1996) made a sharp distinction between test score reliability and measurement precision. The former concept is a group characteristic, uninformative of the precision with which individuals are measured, and the latter concept provides the statistical precision of individual test scores. A high reliability does not imply precise individual-score measurement, and Sijtsma and van der Ark (2015) warned that a common misguided advice in applied research was that, in order to measure individuals precisely, the reliability of a test score should be at least .8 or .9. This misguided advice has been extended to change measurement. For example, Streiner and Norman (2008, p. 285) argued that, because both pretest and posttest scores contained measurement error, change scores run the risk of loss of precision, and the authors advised against using change scores if the reliability was below .5. The authors further claimed that “reliability is a necessary precondition for the appropriate application of change scores” (ibid., p. 285). Different from the first two beliefs, this one attacks change scores from another angle by suggesting that change-score reliability, which is a group characteristic, also is informative about measurement precision for individuals.

Belief 4: Change-score reliability and the correlation between the pretest and the posttest scores cannot both be high.

This belief suggests that only two situations can happen: (1) Change-score reliability, D&&#, is high, whereas the correlation between pretest and posttest scores, D"$, is low. A low correlation suggests that pretest and posttest measure different attributes, but the change score that reflects the difference between the two incomparable entities is reliable; and (2) D&&#is low, whereas D"$is high, suggesting when pretest and posttest measure the same attribute,

(18)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 17PDF page: 17PDF page: 17PDF page: 17

11 of change scores is only .33, and if the correlation is reduced to 0, then the reliability of change scores becomes .8.

According to Bereiter (1963, pp. 9-11), for practical use, pretest and posttest must measure the same attribute—that is, D"$, must be high so that change scores can be derived by

means of taking the difference between pretest and posttest scores. Meanwhile change scores must be reliable—that is, D&&#must be high, because low change-score reliability may suggest that observed change scores are not highly correlated with true change scores, and if D&&#is low, one should not infer true change from observed change. Thus, Belief 4 raises a fundamental question about the usefulness of change scores, because Belief 4 asserts that change scores either cannot be interpreted meaningfully or cannot be used reliably. This belief, according to Bergman (2001, p. 39), introduced “the acute problem of how to measure and interpret change.” Several researchers (e.g., Embretson, 1991; Linn & Slinde, 1977) warned against the use of change scores, and recent studies echoed their concerns (e.g., Guo et al., 2014; Sandell & Wilczek, 2016; Son & Morrison, 2010).

Belief 5: Change scores are inappropriate, because they typically correlate negatively with pretest scores.

The correlation between the change score and the pretest score equals D&" = E)+F+-F)

TF)+,F++-$E)+F)F+

(2.8) (Linn & Slinde, 1977). According to Linn and Slinde (p.122), change score “typically has a negative correlation with the pretest,” because D"$H$typically is smaller than H", given that

H"= H$ often happens and that D"$< 1. Although Linn and Slinde identified this negative

correlation as one of the major defects of change scores, the authors did not explicitly explain why negative correlation was a defect and why change scores should not be used because of this defect. Nevertheless, because of this negative correlation, many researchers (e.g., Finney et al., 1980; Fiszdon & Johannesen, 2010; Guo et al., 2014; Holahan & Moos, 1981; Kerckhoff, 1986; Kim & Camilli, 2014; Li, Cohen, Bottge, & Templin, 2016; Sandell & Wilczek, 2016; Son & Morrison, 2010) suggested that using change scores was problematic.

(19)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 18PDF page: 18PDF page: 18PDF page: 18

12

restrictive assumptions that have been made to gauge the psychometric properties of change scores create confusion between the two distinctive levels of analyzing change and therefore give rise to the five beliefs.

In the next section, we discuss change scores from a multilevel change-score perspective that explicitly distinguishes the two levels of change.

2.4 A Multilevel Change-Score Framework

We present a multivariate multilevel latent variable model to replace the CTT model for change scores. Level 1 pertains to the bivariate distribution of pretest and posttest scores for individuals, and level 2 pertains to between-person differences in the bivariate distribution at level 1.

Level 1 model. At level 1, the joint distribution of pretest and posttest scores of one

individual across hypothetical independent replications of the same test obtained under identical conditions is modeled. To understand this model, we consider the following thought experiment. Suppose, we could establish the pretest and the posttest scores of a person 8 infinitely many times under the same conditions, then we would obtain a bivariate distribution of pretest and posttest scores for this particular person. This hypothetical distribution is the propensity distribution (Lord & Novick, 1968, p. 30), which for change measurement is bivariate. Let 9:" be the pretest score, 9:$ be the posttest score, @: be the change score, [:"

be the true pretest score, \: be the true change, and let ]:" and ]:$ be the random measurement

errors pertaining to the pretest and the posttest scores, respectively. In the level 1 model, 9:",

9:$, ]:", and ]:$ are random variables, following a bivariate distribution with unknown

parameters \:, [:" and the covariance matrix Σ_. Thus, the vector of observed pretest and

posttest scores for person 8 is modeled as >99:" :$? = ` [:" [:"+ \:a + ` ]:" ]:$a (2.9)

We assume that measurement errors for person 8 are bivariate normally distributed as `]]:" :$a ~c(d, Σ_), where Σ_= f H_$) H _)_+ H_)_+ H_$+ g. (2.10) In order for Σ_ to be a valid variance-covariance matrix, Σ_ must be positive semidefinite, that

(20)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 19PDF page: 19PDF page: 19PDF page: 19

13 by H_)_+. In a sense, part of the error variance here can be conceived as a systematic but construct-irrelevant source of variance. The errors are referred to as random errors because they are assumed to be randomly drawn from the bivariate normal distribution (Equation 2.10) for each person. This idea of correlated random errors is similar to the framework of the general linear model for longitudinal data, where random errors are correlated across time points for each subject (Diggle, Heagerty, Liang, & Zeger, 2013, p. 55). In CTT, all systematic variations are absorbed in the true scores, and therefore, errors do not correlate. We notice that R. H. Williams and Zimmerman (1977) and Zimmerman and Williams (1982a) discussed H_)_+≠ 0 within the CTT framework. However, it appears they did not realize that in CTT systematic error is considered part of the true score and not part of the random error, and therefore H_)_+≡ 0 in the CTT model for change.

One may notice that the formulation of the model in equations (2.9) and (2.10) may imply that the true posttest score is multidimensional, but in this chapter for two reasons we do not consider explicitly the dimensionality of the true and observed posttest scores. First, the dimensionality of the true and observed posttest scores should be regarded as an empirical issue. For example, change can happen on one dimension (e.g., a patient’s anxiety level may diminish), but due to treatment new dimensions may appear (e.g., the patient’s introversion level may also diminish and her self-image may become more positive). Therefore, whether a second dimension appears after the treatment is an empirical question. Second, even though equations (2.9) and (2.10) can be conceived as a multidimensional model, the resulting data are unidimensional; that is, the dimensions of pretest and change cannot be distinguished in empirical data produced by the model (see, Reckase, 2009, pp. 194-201). To find the dimensions (i.e., pretest and change), we need posttest items that load differently on the dimensions, that is, some posttest items only measuring change, and others only measuring pretest status. Most instruments do not have items that ask specifically about pretest conditions at posttest.

Level 2 model. The level 2 model deals with the random selection of persons from the

population (Lord & Novick, 1968, pp. 32-34). Let parameter kl) denote the population mean of true pretest scores, let parameter km denote the population mean of true change, and let n"

and n$ denote the disturbance terms representing the difference between the true scores for

(21)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 20PDF page: 20PDF page: 20PDF page: 20

14 `[ [:" :"+ \:a = > kl) kl)+ km? + ` n" n$a, (2.11) with `nn" $a ~c(d, Σo), (2.12) and Σo= f Hl$) Hl)m+ Hl$) Hl)m+ Hl$) H m$+ 2Hl)m+ Hl$) g. (2.13) Appendix A explains how variance-covariance matrix Σo can be derived. Matrix Σo

summarizes the population variance of true scores at pretest (Hl$)), between-person differences in true change (Hm$), and association between true pretest scores and magnitude of true change

(Hl)m). We also assume sampling independence, which means that between-person disturbances are uncorrelated. One may notice that, even without treatment, attributes can be unstable over time, and due to volatility of the attribute, Hm$may capture both the treatment

effect and the change, which the multilevel model does not distinguish.

The level 1 model and the level 2 model are often combined (e.g., Raudenbush & Bryk, 2002, pp. 19-21), which results in the multilevel change-score model,

>9:" 9:$? = ` kl" kl"+ kma + `nn"$+ ]+ ]:":$a, (2.14) with `nn"+ ]:" $+ ]:$a ~c(d, Σ) (2.15) and Σ = f H_$)+ Hl$) Hl)m+ Hl$)+ H_)_+ Hl)m+ Hl$)+ H_)_+ Hm $+ 2H l)m+ Hl$)+ H_$+ g. (2.16) It may be noted that Σ is a linear combination of Σ_ at level 1 and Σo at level 2.

(22)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 21PDF page: 21PDF page: 21PDF page: 21

15 and let Hl)l+ = Hl)m+ Hl$) and Hl$+= H m

$+ 2H

l)m+ Hl$) in Equation (2.13), so that we can rewrite the multilevel change-score model as:

>99:" :$? = ` [:" [:$a + ` ]:" ]:$a, where ` ]:" ]:$a ~c(d, Σ_), and Σ_= f H_$) H_)_+ H_)_+ H_+ $ g, and `[[:" :$a = ` kl) kl+a + ` n" n$a, where `nn"$a ~c(d, Σo), and Σo= f Hl$) Hl)l+ Hl)l+ Hl$$ g,

which is a multivariate empty model. The difference between the multilevel change-score model and the multivariate empty model is that the former model allows us to explicitly disentangle and thus examine important parameters, such as Hl)m and H m

$. As an aside, one may

notice that although the parameters in the multilevel change-score model cannot be estimated empirically, this multilevel change-score framework enables us to correctly examine the five common beliefs.

2.5 Negative Beliefs About Classical Change Scores Revisited: Changes Scores Can Be Useful

Negative belief 1: Change scores are unreliable. Given the multilevel change-score

framework, reliability of change scores, D&&#, can be derived as

D&&#=Fp + Fq+= Fp+ Fp+,F r)+,Fr++-$Fr)r+. (2.17) See Appendix B for the derivation of H&$. Equation (2.17) shows that because H

_$), H_$+, and H_)_+ do not influence Hm$ and vice versa, as long as Hm$ is sufficiently larger than H_$)+ H_$+− 2H_)_+, reliabilityD&&# can be high. Why do many researchers still believe that reliability is intrinsically low? The cause lies in the assumptions researchers made. One may recall that Equations (2.6) and (2.7) assume observed pretest and posttest scores have equal standard deviations (H"= H$), an assumption that several researchers make (Linn & Slinde, 1977; Lord,

1956; Stanley, 1967). Although a few researchers (e.g., R. H. Williams & Zimmerman, 1977, 1996; Zimmerman & Williams, 1982a, 1982b) pointed out that H"= H$ produces low

reliability, they failed to identify the real problem. That is, by assuming H"= H$, which appears

(23)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 22PDF page: 22PDF page: 22PDF page: 22

16

impose additional constraints on the true change variance and the error variances. To see this, one may recall the variance of the observed pretest score and that of the posttest score defined in Equation (2.16) in the multilevel change-score framework, and notice that by assuming H"=

H$, one implicitly assumes that H_$)+ Hl$)= Hm

$+ 2H

l)m+ Hl$)+ H_$+, resulting in H_$)= H

m$+ 2Hl)m+ H_$+. (2.18) Equation (2.18) shows that if between-person differences in change exists (i.e., Hm$> 0), which

is a necessary condition for positive change-score reliability (Equation 2.17), then the assumption H"= H$ likely causes H_$)≠ H_$+, which contradicts the assumption H_)= H_+ that is often made in the literature (e.g., Lord, 1956; Rogosa, Brandt, & Zimowski, 1982; Willett, 1988). If one assumes both H"= H$and H_)= H_+, the direct consequence is that they force Hm$+ 2H

l)m= 0, which in turn forces Hl)m = (−1/2)Hm$, hence Hl)m ≤ 0. This means that by making the two assumptions, researchers tacitly assume that Hl)m> 0 cannot happen. In the literature, Hl)m> 0 refers to a fan-spread pattern (Collins, 1996b; Hertzog, von Oertzen, Ghisletta, & Lindenberger, 2008; Raykov, 1993; see Figure 2.1). In the social and behavioral sciences, fan-spread patterns occur frequently. For example, in reading, the Matthew effect hypothesizes that the gap between able readers and poor readers in terms of reading development increases exponentially (e.g., Bast & Reitsma, 1997; Stanovich, 1986). Therefore, although it seems that assumptions H"= H$and H_)= H_+apply to the pretest and posttest only, they actually impose restrictions on inter-individual parameters Hm$ and Hl)m, from which between-person differences in change are inferred.

(24)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 23PDF page: 23PDF page: 23PDF page: 23

17 The multilevel framework has clarified that the assumption H"= H$ imposes

constraints on parameters both at the group level (Hm$ and Hl)m) and at the individual level (H_$)and H_$+), and it is these additional constraints on the parameters that make the reliability of change scores appear always to be low. However, from both theoretical perspective and empirical perspective, there is no need to make these additional restrictions. Change scores are not intrinsically unreliable, and whether change scores can be used is an empirical issue. For example, in a recent empirical study on the measurement of task-switching ability, Hughes, Linck, Bowles, Koeth, and Bunting (2014, p. 713) estimated the reliability of a few change measures without assuming H"= H$, and the estimated reliabilities were greater than .8.

Even when their reliability is low, change scores are not entirely useless. For example, Overall and Woodward (1975) showed that low change-score reliability may result in high power for detecting mean changes. Change scores also can be used as a dependent variable in regression analysis, provided that large samples are collected to guarantee enough statistical power (Allison, 1990).

Researchers should not avoid using change scores simply because it is commonly believed that change-score reliability is low. For a particular study, researchers are advised to first estimate the reliability of change scores and then decide, based on the estimated reliability, whether the change scores are useful for that particular study. We summarize our conclusions to replace Belief 1 as follows:

Result 1: Change-score reliability is not intrinsically low, and its value in a particular application is an empirical issue.

Negative belief 2: Change-score reliability tends to be lower than the reliability of separate test scores; hence, use of change-score reliability is problematic if the reliability of either test score is low. One may recall that this belief is based on two assumptions, H"= H$

and HI)I+= 0 . The equality assumption H"= H$is problematic, because it also imposes constraints on change parameters both at the group level (Hm$ and Hl)m) and at the individual level (H_$)and H_$+). Assumption HI)I+= 0 is part of the CTT framework, which might not be favorable in situations where, for example, a patient tries to recall her previous answers to the pretest items and tries to give answers to the posttest items consistent with the previous answers.

(25)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 24PDF page: 24PDF page: 24PDF page: 24

18 D""#= Fu) + Fu)+,Fr)+, (2.19) and D$$#= Fu) +,F p+,$Fu)p Fu)+,F p+,$Fu)p,Fr++. (2.20) Notice that Hl$), Hm$, and Hl)m are from Level 2, and H_$)and H_$+ are from Level 1. Reliabilities D""#and D$$#thus combine the two levels. One may notice that equations (2.19) and (2.20) are not to be used to estimate the reliabilities at pretest and posttest, because Hm$, H

l$), Hl)m, H_$)and H_$+ cannot directly be observed. Comparing equations (2.19) and (2.17), one can show that, as long as the true change variance Hm$is large enough, reliability D

&&#is not necessarily smaller

than D""# (see Appendix C). For example, if we have a homogeneous population at pretest (i.e., Hl$)= 0) and persons change in different magnitudes, then D""#= 0 and D&&#> 0.

To investigate whether D&&#is likely to be larger or smaller than D$$#, we derived the following equation (see Appendix C),

D&&#− D$$#=v-Fr)

+-F

r++,$Fr)r+wvFu)+,$Fu)pw,Fp+($Fr)r+-Fr)+)

(Fp+,F

r)+,Fr++-$Fr)r+)(Fu)+,Fp+,$Fu)p,Fr++) . (2.21) The result D&&#> D$$#is possible under particular conditions (Appendix C). Table 2.1 presents numerical examples where D&&# can be larger than D""# and where D&&# can be larger than both D""#and D$$#. To facilitate understanding, in Table 2.1 we also present the correlation between observed change scores and observed pretest scores based on the multilevel model,

D&"=

Fu)p,Fr)r+-Fr)+

TvFu)+,F

r)+wvFp+,Fr)+,Fr++-$Fr)r+w

, (2.22) (see Appendix D for the derivation of the numerator in Equation 2.22) and the correlation between true pretest score and true change score,

Dl)m =

Fu)p TFu)+F

p+

(26)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 25PDF page: 25PDF page: 25PDF page: 25

19 Table 2.1. The relationship among D&&#, D""#, and D$$#.

H_)_+ Hl)m Dl)m D&" Reliability D""# D$$# D&&# .03 -.12 -.1 -.14 .91 .96 .91 .05 -.12 -.1 -.13 .91 .96 .94 .07 -.12 -.1 -.11 .91 .96 .96 .03 -.61 -.5 -.51 .91 .93 .91 .05 -.61 -.5 -.50 .91 .93 .94 .07 -.61 -.5 -.49 .91 .93 .96

Note. H_$)= H_$+= .1, Hl$)= 1, Hm$= 1.5. To facilitate understanding, Dl)m and D&" are also reported.

(27)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 26PDF page: 26PDF page: 26PDF page: 26

20

Table 2.2 The relationship among D&&#, D""#, and D$$#, when MID is taken into account.

H_)_+ Hl)m Dl)m D&" Reliability D""# D$$# D&&# .03 -.04 -.11 -.20 .91 .91 .50 .05 -.04 -.11 -.18 .91 .91 .58 .07 -.04 -.11 -.15 .91 .91 .70 .03 -.19 -.51 -.47 .91 .88 .50 .05 -.19 -.51 -.47 .91 .88 .58 .07 -.19 -.51 -.47 .91 .88 .70

Note. H_$)= H_$+= .1, Hl$)= 1, Hm$= .14. To facilitate understanding, Dl)m and D&" are also reported.

Table 2.2 shows that when taking MID into account, change-score reliability is smaller than pretest and posttest reliabilities, because in general change observed in clinical settings is quite small (Norman et al., 2003) – the variance of the change scores ranges from .5 SD of the pretest scores (i.e., MID) to 1SD. Thus, empirical studies in which true change is small are likely to report that change-score reliability is lower than pretest and posttest reliabilities. We now use Equation (2.21) as an example to explain why small Hm$is likely to result in D&&#− D$$#< 0. The denominator of Equation (2.21) is always positive (see Appendix C), so we focus on the sign of the numerator. Notice that −Hl)Hm≤ Hl)m≤ Hl)Hm, meaning that the upper and lower bounds of Hl)m are controlled by the size of Hm$, while keeping Hl$) constant. When Hm$ goes to 0, Hl)m goes to 0, and the first term of the numerator of Equation (2.21), v−H_$)− H_$++ 2H_)_+wvHl$)+ 2Hl)mw, becomes v−H_$)− H_$++ 2H_)_+wHl$), which is smaller than or equal to 0. Also when Hm$goes to 0, the second term of the numerator H

m$(2H_)_+− H_$)) becomes 0. Thus, when Hm$goes to 0, the numerator becomes a non-positive number, which

explains why small Hm$likely results in D&&#< D$$#.

(28)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 27PDF page: 27PDF page: 27PDF page: 27

21

Result 2: Change-score reliability is not intrinsically lower than the reliability of the separate test scores, and its value in a particular application is an empirical issue.

Negative belief 3: If change-score reliability is low, then change measurement at the individual level is rendered useless. In the multilevel change-score framework, measurement

precision of change (H_q) can be derived from Equation (2.17) as

H_q= TH_$)+ H_$+− 2H_)_+. (2.24) Because H_$), H_$+and H_)_+ are parameters at the individual level in the variance-covariance matrix Σ_, H_q also is a parameter at the individual level. Whether individual change can be measured precisely is determined completely by H_q and is independent of the reliability of change scores. Imagine that after treatment no one has changed, and therefore, because true change is zero, change-score reliability is zero. In this case, as long as the measurement errors are small, individuals may still be measured precisely. Collins (1996a, p. 290) argued that: “it is possible for a measure to show poor reliability, even to have a reliability of 0, and yet to be a highly precise measure of change.” Similarly, Overall and Woodward (1975, p. 86) observed that “the t test [for paired observations] can be calculated entirely from difference scores that have zero reliability,” which “provide[s] a superior basis for rejection of the null hypothesis”(ibid., p. 85). Hence, for individual decisions, the precision, rather than the reliability, decides whether an individual’s change score is useful. Sijtsma and van der Ark, (2015) noted that measurement practitioners often ignored this conclusion. We summarize our conclusions to replace Belief 3 as follows:

Result 3: Measurement precision rather than change-score reliability determines whether an individual’s change score is useful for making individual decisions.

Negative belief 4: Change-score reliability and the correlation between the pretest and the posttest scores cannot both be high. Arguments supporting this belief are based on Equation

(2.6), which involves the problematic assumption H"= H$. In fact, D&&# and

D"$ can be high simultaneously. To see this, from Equation (2.16) we first derive

D"$= Fu)+,F u)p,Fr)r+ TvFu)+,F r)+wvFu)+,Fp+,$Fu)p,Fr++w . (2.25)

According to equations (2.17) and (2.25), D&&#and D"$ are influenced by five quantities, which are Hm$, H

(29)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 28PDF page: 28PDF page: 28PDF page: 28

22

variance of a variable or the loading of a latent variable is fixed to be 1 to identify the scale. Table 2.3 presents a summary showing how D&&#and D"$change when one of the quantities changes, keeping the other four quantities fixed.

Table 2.3 shows that, when Hm$increases while keeping the remaining four quantities

fixed, D&&#increases and D"$ decreases. When true change-score variance is small, such as in clinical settings where MID is used, D&&#and D"$are not simultaneously high; see Condition I in Table 2.3. However, this result does not justify the acceptance of the unreliability-invalidity dilemma in each and every situation, because if Hm$is sufficiently high – which might be the

case in some research fields – D&&#and D"$ can be high simultaneously; see Condition I in Table 2.3. For Condition II (Hl)m changes), Condition III (H_$) changes), Condition IV (H_$+ changes), and Condition V (H_)_+changes), Table 2.3 shows that when Hm

$is high, D

&&#and D"$

can be high simultaneously. However, this result is unlikely to be observed in clinical settings where Hm$typically is rather small; see all the conditions in Table 2.3 where H

m$= .14. Thus,

we have illustrated that Belief 4 is based on the problematic assumption H"= H$and on

empirical observations, for example, in clinical settings, where true change-score variance typically is small. Whether D&&#and D"$ can be simultaneously high is an empirical issue and therefore should not be accepted as if it were always true. Finally, we summarize the conclusions to replace Belief 4 as follows:

Result 4: The unreliability-invalidity dilemma is not an intrinsic characteristic of change scores. Whether change-score reliability and the correlation between the pretest and the posttest scores can be high simultaneously is an empirical issue.

Negative belief 5: Change scores are inappropriate, because they typically correlate negatively with pretest scores. One may recall that this belief was based on the assumption

H"= H$ (e.g., Linn & Slinde, 1977), which is problematic, implying that Linn and Slinde's

(30)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 29PDF page: 29PDF page: 29PDF page: 29

23 paired samples t test is most powerful when every person in the sample changes by the same magnitude, which implies zero change-score reliability (Overall & Woodward, 1975).

The pretest score often is also considered as a predictor of change. For example, researchers may hypothesize that the effectiveness of a treatment depends on pretest status. In this case, D&"should not be at the center of attention in the first place (Lord, 1963; Rogosa et

al., 1982). Because of measurement errors, D&"gives a negatively biased estimate of Dl)m. The relation between D&"and Dl)m can be derived by means of the multilevel change-score framework in which D&"= Fu)p,Fr)r+-Fr)+ TvFu)+,F r)+wvFp+,Fr)+,Fr++-$Fr)r+w . (2.26)

It also follows that

Dl)m =

Fu)p

Fu)Fp. (2.27) Combining equations (2.26) and (2.27), the next equation, which is

D&"=

Fu)FpEu)p,Fr)r+-Fr)+

TvFu)+,F

r)+wvFp+,Fr)+,Fr++-$Fr)r+w

, (2.28) demonstrates that caution must be exercised when inferring Dl)m from D&". For example, suppose Dl)m > 0 and Hl)HmDl)m+ H_)_+− H_$)< 0, then D&"< 0, and in this case Dl)m and D&"have opposite signs. Hence, one cannot conclude that Dl)m< 0 simply because D&"< 0. Suppose Dl)m= 0, then D&"< 0 as long as H_)_+= 0 or H_+− H_)< 0, because H_)_+− H_$)≤ H_)H_+− H_$). In Table 2.3, we show how Dl)m and D&"are influenced by Hm$, Hl)m, H_$), H_$+, and H_)_+. This numerical example suggests that D&"can be much smaller than Dl)m and that D&" and Dl)m can have opposite signs, and thus drawing an inference about Dl)m based on D&"is unwarranted.

As aside, it is possible to statistically correct for the negative bias in D&", but this

requires an unbiased estimate of reliability. In addition, when it is necessary to control for D&"

(31)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 30PDF page: 30PDF page: 30PDF page: 30

24

Table 2.3 A summary table of D&&#, D"$, Dl)m, and D&", given various values of Hm$, Hl)m, H_$), H_$+,

and H_)_+.

Condition Hm$ Hl$) Hl)m H_$) H_$+ H_)_+ D_)_+ D""# D$$# D&&# D"$ Dl)m D&"

I. Hm$ changes .14 1 .12 .1 .1 .03 .30 .91 .93 .50 .90 .32 .09 .64 1 .12 .1 .1 .03 .30 .91 .95 .82 .78 .15 .05 1.04 1 .12 .1 .1 .03 .30 .91 .96 .88 .71 .12 .04 II. Hl)m changes .14 1 .04 .1 .1 .03 .30 .91 .92 .50 .89 .11 -.05 .14 1 .12 .1 .1 .03 .30 .91 .93 .50 .90 .32 .09 .14 1 .19 .1 .1 .03 .30 .91 .94 .50 .91 .51 .22 .64 1 .08 .1 .1 .03 .30 .91 .95 .82 .77 .10 .01 .64 1 .24 .1 .1 .03 .30 .91 .95 .82 .81 .30 .18 .64 1 .40 .1 .1 .03 .30 .91 .96 .82 .86 .50 .36 III. H_) $ changes .14 1 .12 .1 .1 .03 .30 .91 .93 .50 .90 .32 .09 .14 1 .12 .2 .1 .03 .21 .83 .93 .37 .86 .32 -.07 .14 1 .12 .3 .1 .03 .17 .77 .93 .29 .83 .32 -.19 .64 1 .12 .1 .1 .03 .30 .91 .95 .82 .78 .15 .05 .64 1 .12 .2 .1 .03 .21 .83 .95 .73 .75 .15 -.05 .64 1 .12 .3 .1 .03 .17 .77 .95 .65 .72 .15 -.13 IV. H_$) changes .14 1 .12 .1 .1 .03 .30 .91 .93 .50 .90 .32 .09 .14 1 .12 .1 .2 .03 .21 .91 .87 .37 .87 .32 .08 .14 1 .12 .1 .3 .03 .17 .91 .82 .29 .85 .32 .07 .64 1 .12 .1 .1 .03 .30 .91 .95 .82 .78 .15 .05 .64 1 .12 .1 .2 .03 .21 .91 .90 .73 .76 .15 .05 .64 1 .12 .1 .3 .03 .17 .91 .86 .65 .74 .15 .05 V. H_)_+ changes .14 1 .12 .1 .1 .01 .10 .91 .93 .44 .89 .32 .05 .14 1 .12 .1 .1 .03 .30 .91 .93 .50 .90 .32 .09 .14 1 .12 .1 .1 .05 .50 .91 .93 .58 .92 .32 .14 .64 1 .12 .1 .1 .01 .10 .91 .95 .78 .77 .15 .03 .64 1 .12 .1 .1 .03 .30 .91 .95 .82 .78 .15 .05 .64 1 .12 .1 .1 .05 .50 .91 .95 .86 .79 .15 .08 Note. We fixed Hl$)= 1, which is to identify the scale. For Condition II, III, IV, and V, we considered two situations

regarding Hm$: Hm$= .14 and Hm$= .64. The former situation (Hm$= .14) is consistent with MID and common

(32)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 31PDF page: 31PDF page: 31PDF page: 31

25 Finally, if the purpose is to understand each individual’s change trajectory, D&"and

Dl)m are inappropriate. Instead, one should resort to longitudinal models because change trajectories require change measurement at more than two time points (e.g., Raudenbush, 2001; Rogosa et al., 1982). We summarize our conclusions to replace Belief 5 as follows:

Result 5: A negative correlation between the observed change score and the observed pretest score does not entail a negative correlation between the true change score and the true pretest score. Changes scores are not necessarily inappropriate, and the usefulness of change scores depends on the specific research question envisaged.

2.6 Discussion

We presented a multilevel perspective that was helpful to analyze the negative beliefs about change scores, and to suggest that change scores can be useful after all. Properly distinguishing between within-person change at the individual level and between-person differences in change at the group level is of utmost importance for understanding the psychometric properties of change measurement. For example, one should not confuse measurement precision, which pertains to the individual level, with reliability, which pertains to the group level, and one should not impose assumptions at one level without examining the consequences at the other level. In addition, we notice that our two-level framework also corresponds to Glaser's (1963) distinction between norm-referenced measurement and criterion-referenced measurement (also see, Crocker & Algina, 2008, pp. 105-212). Norm-referenced measurement requires that persons be distinguished from each other, and therefore, reliability should be closely examined. Criterion-referenced measurement focuses on whether an individual person meets a certain requirement (e.g., a minimum score of 60 out of 100), and therefore, measurement precision is important.

(33)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 32PDF page: 32PDF page: 32PDF page: 32

26

For practical use of change scores, we advise researchers to focus on the psychometric properties of change scores that are relevant to their research questions. If the purpose is to study the inter-individual differences in change, reliability of change scores is important. If the purpose is to make decisions on each individual person, then attention should be paid to measurement precision. Not every research question concerning change scores requires high change-score reliability, and therefore, researchers should not automatically dismiss change scores if reliability is low. Two comments concerning estimating change-score reliability and interpreting change scores are in order.

First, we need better methods for modeling and estimating change-score reliability. It is likely that in pretest–posttest studies, a person’s measurement errors between two measurement occasions are correlated. Because correlated measurement errors are not assumed in the CTT framework, a multilevel framework is preferable. Error covariance H_)_+ cannot be estimated in pretest–posttest designs, and hence, a better approach to modeling change is to use longitudinal models based on at least three measurement occasions (Rogosa et al., 1982). In addition, given that the traditional CTT approach to estimating change-score reliability (Equation 2.4) may be inadequate in cases where measurement errors are correlated (e.g., R. H. Williams & Zimmerman, 1977; Zimmerman & Williams, 1982a), better methods for estimating change-score reliability are needed. Studying such models deserves full attention but is beyond the scope of this chapter.

Second, we need methods for properly interpreting change scores. To interpret change scores, substantial meaning must be assigned to raw change scores so that, for example, clinicians can provide meaningful feedback to a patient about her progress. The process of assigning meaning based on reference to a score distribution is known as norming (Angoff, 1984, p. 39). As far as we know, little research is available about norms for individual change, possibly because researchers’ pessimistic beliefs about the use of change scores have prohibited the development of such knowledge. We believe that norming change scores is important for future research, where practical guidelines can be provided for users who intend to interpret change scores.

(34)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 33PDF page: 33PDF page: 33PDF page: 33

(35)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 34PDF page: 34PDF page: 34PDF page: 34

28

2.7 Appendices 2.7.1 Appendix A

Equation (2.11) models the true pretest score [:"and the true posttest score [:"+ \:.

Alternatively, we may model the true pretest score [:"and the true change \:(rather than the

true posttest score) as follows: `[\:" :a = ` kl) kma + ` y" y$a, (2.A1) with `yy" $a ~c(d, Σz), (2.A2) where Σz= f Hl) $ H l)m Hl)m Hm

$g. Hl$) denotes the variance of the true pretest score. Hl)m denotes the covariance between the true pretest score and true change. Hm$denotes the variance of true

change.

Given equations (2.A1) and (2.A2), we can derive the variance of the true posttest score, denoted by Hl$+, and the covariance between the true pretest score and the true posttest score, denoted by Hl)l+, as follows:

Hl$+= Hl$)+ 2Hl)m+ Hm$, (2.A3) and

Hl)l+= Hl$)+ H

l)m. (2.A4) Note that in Equation (2.13) the covariance matrix Σo can be expressed as

Σo= f

Hl$) H l)l+ Hl)l+ Hl$+

g, (2.A5) and given equations (2.A3) and (2.A4), we thus can derive

Σo= f

Hl$) Hl$)+ Hl)m Hl$)+ Hl)m Hl$)+ 2Hl)m+ Hm

(36)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 35PDF page: 35PDF page: 35PDF page: 35

29 2.7.2 Appendix B

Let H"$ denote the variance of the observed pretest score, H$$ denote the variance of the

observed posttest score, and let H"$ denote the covariance between the pretest score and the

posttest score. Then, the variance of change scores H&$ is

H&$= H$$− 2H"$+ H"$. (2.B1) According to Equation (2.16), H$$= Hm$+ 2Hl)m+ Hl$)+ H_$+, (2.B2) H"$= Hl)m+ Hl$)+ H_)_+, (2.B3) and H"$= H _$)+ Hl$). (2.B4) Thus, replacing the right-hand side of equations (2.B1) with equations (2.B2), (2.B3), and (2.B4), we obtain

(37)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 36PDF page: 36PDF page: 36PDF page: 36

30

2.7.3 Appendix C

Here, it suffices to show that it is possible to obtain positive values for D&&#− D""# and D&&#− D$$# theoretically. Whether D&&#− D""#> 0 and D&&#− D$$#> 0 are observed in empirical studies is irrelevant.

D&&#− D""#can be derived as follows. Given equations (2.17) and (2.19), D&&#− D""#= Fp

+F

r)+-Fu)+vFr)+,Fr++-$Fr)r+w

vFp+,F

r)+,Fr++-$Fr)r+wvFu)+,Fr)+w. (2.C1) According to the Cauchy–Schwarz inequality,

H_)_+≤ H_)H_+, which implies that

H_$)+ H_$+− 2H_)_+≥ H_$)+ H_$+− 2H_)H_+≥ vH_)− H_+w

$

.

This means that the denominator of Equation (2.C1) is always positive. The denominator can equal 0, when H_)= H_+= Hl) = Hm= 0. The numerator of Equation (2.C1) can be positive as well, as long as (for example) Hm$ is high enough (keeping everything else constant) so that

Hm$H_$)> Hl$)vH_$)+ H_$+− 2H_)_+w. Thus, we have shown that D&&#− D""#> 0 is possible. D&&#− D$$# can be derived as follows. Given equations (2.17) and (2.20), D&&#− D$$#= Fp + Fp+,F r)+,Fr++-$Fr)r+− Fu)+,F p+,$Fu)p Fu)+,F p+,$Fu)p,Fr++ =Fp+vFu)+,Fp+,$Fu)p,Fr++w-vFu)+,Fp+,$Fu)pwvFp+,Fr)+,Fr++-$Fr)r+w vFp+,F r)+,Fr++-$Fr)r+wvFu)+,Fp+,$Fu)p,Fr++w =v-Fr)+-Fr++,$Fr)r+wvFu)+,$Fu)pw,Fp+v$Fr)r+-Fr)+w vFp+,F r)+,Fr++-$Fr)r+wvFu)+,Fp+,$Fu)p,Fr++w . (2.C2) We first examine the denominator of Equation (2.C2). According to the Cauchy–Schwarz inequality,

−H_)H_+≤ H_)_+≤ H_)H_+, And

−Hl)Hm≤ Hl)m≤ Hl)Hm, and thus, it can be proven that

(38)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 37PDF page: 37PDF page: 37PDF page: 37

31 We now examine whether the numerator of Equation (2.C2) – v−H_$)− H_$++ 2H_)_+wvHl$)+ 2H

l)mw + Hm$v2H_)_+− H_$)w – can be positive. According to the Cauchy– Schwarz inequality, −H_$)− H_$++ 2H_)_+cannot be positive:

−H_$)− H_$++ 2H_)_+ ≤ −H_$)− H_$++ 2H_)H_+= −vH_)− H_+w

$

≤ 0. (2.C3) Therefore, the numerator of Equation (2.C2) can be positive due to the following three sufficient conditions: 1) Hl$)+ 2Hl)m< 0 and Hm

$v2H _)_+− H_$)w > 0 ; 2) Hl$)+ 2Hl)m < 0 , Hm$v2H_)_+− H_$)w < 0 , but v−H_$)− H_$++ 2H_)_+wvHl$)+ 2Hl)mw + Hm$v2H_)_+− H_$)w > 0 ; and 3) Hm$v2H _)_+− H_$)w > 0, Hl$)+ 2Hl)m> 0, but v−H_$)− H_$++ 2H_)_+wvHl$)+ 2Hl)mw + Hm$v2H

_)_+− H_$)w > 0 . The three conditions can happen given suitable values for the parameters. Take the first condition for example, the numerator of Equation (2.C2) is positive if Hl)m≤ {−

"

$| Hl$) and H_)_+>

"

$H_$). Thus, we have shown that D&&#− D$$#> 0 is possible, given suitable values for Hl$), Hm

$, H

(39)

537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu 537639-L-bw-Gu Processed on: 14-1-2020 Processed on: 14-1-2020 Processed on: 14-1-2020

Processed on: 14-1-2020 PDF page: 38PDF page: 38PDF page: 38PDF page: 38

32

2.7.4 Appendix D

The covariance between @ and 9", denoted by H&", is

H&"= H"$− H"$. (2.D1)

Referenties

GERELATEERDE DOCUMENTEN

It does not incorporate the needs variables as set forward in the IT culture literature stream (e.g. primary need, power IT need, etc.) Even though some conceptual overlap exists

In order to collect as much data as possible on the issue of the mutual influence of individual and collective attitudes to change and the influence of a change agent on this

Following up on this timeline, Beaudry and Pinsonneault (2005) provide a possible solution for the change manager issue of being unable to predict the individual responses

Within this research the relationship between the independent variables perceived discrepancy, perceived management support, experienced self-efficacy, perceived organizational

As argued by Kotter and Schlesinger (1989), participation in the change process had a high impact on the willingness of middle management within Company XYZ to change.. Moreover,

ability or power of an organization and her members to perform to a certain level under changing conditions.” Bennebroek Gravenhorst, Werkman &amp; Boonstra

According to these results it is thus crucial for organizations and managers that the PMS in place is designed and used in an interactive way when employees need

The research question can therefore be answered as follows: the outcomes of the case study indicate that changes in the performance measurement system have a negative