University of Groningen A captivating snapshot of standardized testing in early childhood Frans, Niek

(1)

A captivating snapshot of standardized testing in early childhood

Frans, Niek

DOI:

10.33612/diss.95431744

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2019

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Frans, N. (2019). A captivating snapshot of standardized testing in early childhood: on the stability and utility of the Cito preschool/kindergarten tests. Rijksuniversiteit Groningen.

https://doi.org/10.33612/diss.95431744

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

Chapter 5

Evaluating stability in early childhood

assessment-supported decision-making

This chapter is based on: Frans, N., Post, W.J, Oenema‐Mostert, C. E., & Minnaert, A.E.M.G. (2019). Evaluating stability in early childhood assessment‐supported decision‐making. Manuscript under revision.

(3)

5

Abstract

Given that assessment‐supported decisions are based on assumptions concerning what test scores say about future development, score stability is an important factor in decision‐making processes. The stability of test scores is defined as the expected relationship between current and future test scores. This chapter explores two distinct expectations about scores from the Cito Pupil Monitoring System: 1) the expectation that scores of all children grow at an equal rate, and 2) the expectation that individual children grow at their own rates. Each of these expectations leads to different conclusions from the same data. This article focuses on the following research question: “How well do predictions made under the assumption of individual growth rates perform in comparison to predictions made under the assumption of equal growth rates?” This question was investigated based on scores from mathematics and language tests administered to 911 children in the period from kindergarten to first grade. The results indicate that short‐term growth is a poor indication of subsequent growth. Consequently, stagnated scores are unlikely to identify children who are at risk of developing at an overall slower pace. Although the assumption of equal growth rates provides a better indication of future ability, wide deviations from predicted scores can be expected.

(4)

5

Introduction

What does it say about the child, those achievement levels? What does it say about these children? (...) I mean children make peaks [draws zigzag line on the table], one time they’re a bit higher, the other time they’re a bit lower. (Mona, 1:16) This fragment is an excerpt from an interview with one of the preschool teachers in Chapter 2, Mona, in which she discusses use of the Cito preschool test in her classroom. This test is part of the pupil monitoring system (abbreviated in Dutch to LOVS), which is administered by more than 80% of all Dutch schools (Gelderblom et al., 2016; Veldhuis & Van den Heuvel‐Panhuizen, 2014). While individual assessment is a crucial aspect in the early identification of children who are in need of intervention (Bornstein & Lansford, 2013), the issue raised by Mona is often described as being characteristic of early‐childhood testing. For example, based on a meta‐analysis on academic assessment between preschool and second grade, La Paro and Pianta (2000) conclude that “instability or change may be the rule rather than the exception during this period” (p. 476). As a result of the rapid and discontinuous development at this age, as well as young children’s behavior within the testing situation and differences in their home context, early test scores are innately unstable (Bordignon & Lam, 2004; Nagle, 2000). The problem that Mona had experienced in her classroom is that unstable scores make it difficult to draw inferences concerning future academic functioning (Nagle, 2000). To address this concern, this chapter evaluates the stability of early mathematics and language scores by exploring a number of score expectations derived from the recommendations stated in the Cito test manual. According to Cronbach (1971), the validity of all test‐based decisions is directly influenced by the predictions that can be made from the test scores. When deciding between several courses of action, a prediction is made that one course of action will lead to a more satisfactory outcome than another would. The function of a test is to improve these predictions (Cronbach, 1971). Remediation efforts that are based on test results are taken under the assumption that the test provides a stable prediction of the child’s future abilities (Bracken & Walker, 1997). In this context, the term “stable” refers to a set of assumptions concerning the predictability of later test scores based on the current test scores (Wohlwill, 1973). Wohlwill specifies at least four different types of stability, which can be formulated as assumptions about how children develop relative to themselves (intra‐individual change) and others (inter‐individual change). Depending on personal and/or theoretical assumptions, expectations for future development can differ, even when the same information is provided. An observed lack of stability (as defined in this manner) can result from incorrect expectations concerning how scores will change over time, as well as from random deviations of test scores around their expected values (Rudinger & Rietz, 1998).

(5)

5

As an illustration of diverging personal expectations, consider a small task that was presented to Mona and five other teachers in the study described in Chapter 2 (Appendix C). Each teacher was given 10 sets of two scores from the preschool tests designed by Cito (Koerhuis, 2010; Lansink, 2009). Each set of scores reflected a child’s percentile rank at the middle and end of kindergarten. The scores were expressed in a manner that was familiar to these teachers: achievement levels of 20 percentile points. The teachers were asked to rank the 10 “children” from least to most problematic and to explain their reasoning. In accordance with her assumption of wide fluctuations within the scores of individual children, Mona ordered the pairs from low to high by averaging the two scores. In contrast, another teacher, Ina, ordered the children according to the progress that they had made between the middle and end of kindergarten. As Ina explained: If he scored far at the lower end, I hope that he will score in a higher category later (...) Although it may still not be satisfactory (...), for me, it’s a big leap (Ina, 10:8) While Mona was likely to have regarded differences in scores between the middle and end of kindergarten as random variations around the average level of a given child, Ina assumed that differences in the scores reflected actual growth in ability. A third teacher, Ria, ordered the pairs primarily according to the last score achieved, reasoning that children need to reach a certain level before advancing to first grade. As illustrated by this example, different teachers might have different personal assumptions about the relationship between scores over time. The scores that teachers perceive as “problematic” might thus differ according to their expectations with regard to future scores. The distinct expectations presented in the example reflect two separate recommendations made in the manual for the Cito preschool test (Koerhuis, 2010; Lansink, 2009). For this test, teachers are instructed to explore 1) the progress of individual children between the last two tests and over a longer period, as well as 2) their achievement level relative to national norms. Based on these instructions, the manual specifically recommends identifying children who fail to show progress in their scores or who score in the lowest 20% (≤P20) on the mathematics and language tests as being at risk with regard to their development in the respective subject areas (Koerhuis, 2010; Lansink, 2009). The first recommendation reflects the expectation that these children will continue to develop according to their individual growth curves and, as such, will continue to show insufficient progress. The second recommendation reflects the expectation that the children will grow at an average rate, thereby continuing to score at low levels of ability. The first of these expectations assumes function stability, and the second assumes linear stability (Wohlwill, 1973). While predictions based on linear stability assume that a given child will continue to achieve the same rank score, predictions based on

(6)

5

function stability are based on both the child’s scores and estimated rate of growth. To date, it remains unclear whether scores on these tests are stable enough to distinguish individual growth rates from random noise. In Chapter 4, we explored the stability of early‐childhood test scores by evaluating the model fit of function stability and linear stability on data from 1402 children from kindergarten through third grade. According to our findings, the test scores for a majority of children could be described adequately based on the assumption of equal growth rates (linear stability). The assumption of individual growth rates (functional stability) provided a significantly better description for the scores of only a small group of children with regard to the language test (12.1%) and the mathematics test (10.7%). The results further highlighted the difficulty of distinguishing structural differences in growth rates from random variation based on a small number of test administrations. Based on these results, we concluded that predictions should not be based on stagnations or increases in scores over a few test administrations, as such differences are more likely to reflect random fluctuations as opposed to actual differences in growth rates. In practice, teachers often have only a limited number of test scores to use as a foundation for prospective decisions. This is in contrast to Chapter 4, which describes growth over a four‐year period, using all available information. The current chapter explores a number of score expectations by applying the two recommendations from the test manual to scores previously obtained by individual children. By examining which assumption of stability generates the best predictions, we evaluate the test‐manual recommendations from the perspective of the teacher, focusing on the following research question: “How well do predictions based on the assumption of individual growth rates (function stability) perform in comparison to predictions based on the assumption of equal growth rates (linear stability)?” The current chapter also examines the relative benefits of kindergarten scores to the accuracy of predictions. If these scores add information about a child’s growth curve or ability level, subsequent predictions should improve when these scores are included, as compared to predictions made without these scores. If kindergarten scores are indeed so unstable that the scores add more noise than information, however, the accuracy of the predictions should remain the same or decrease when these scores are included. This element of the chapter is formulated in the following research question: “To what extent does the inclusion of kindergarten test scores enhance the accuracy of subsequent predictions?” Finally, the current chapter examines the occurrence of stagnated scores and scores at or below the 20th_{percentile (≤P} 20) as indicators of overall lower growth rates and future ≤P20 scores respectively.

(7)

5

Method

Sample The sample consists of a subsample from the cohort of 1402 children in 59 schools throughout the Netherlands who participated in the study described in Chapter 4. The entire sample is described in detail in Appendix B. The children were tested biannually from kindergarten through third grade. Given that the current chapter aims to assess the relative benefit of kindergarten tests with regard to predictions, only children who had been tested in kindergarten were included. The first test administration in kindergarten was used, as the majority of children had been tested at that time (76% for mathematics and 77% for language). Because the previous chapter had identified bias in the older version of the kindergarten test, only children who had been tested according to the latest version were included in this study. The subsample consists of 911 children, primarily from Dutch families (93.7%) in which at least one parent had completed basic education (92.5%, at least 10 years of education; vmbo gl/tl). Given that the latest kindergarten mathematics tests were published one year later than the language tests were, the subsample of children was smaller for mathematics (n = 437) than it was for language (n = 897). The proportion of boys and girls in the sample was roughly equal (48.4% girls). A small number of children (1.5%) received special‐needs funding, and 4.9% of the children had repeated a grade at least once between kindergarten and third grade. The mean age of the children upon entering kindergarten (September 1, 2010) was 5 years and 5 months (SD = 5 months). Instruments The tests of the LOVS are norm‐referenced, standardized multiple‐choice tests that are typically administered biannually by the classroom teacher, either individually on the computer or in groups using a paper‐and‐pencil form. All of the items were designed by a panel of assessment experts, teachers, and educational professionals, and all had been previously assessed according to a one‐parameter logistic model (Verhelst, Verstralen, & Eggen, 1991) on large representative samples of children in primary school. The psychometric properties of these tests have been assessed as satisfactory by an independent committee that evaluates test construction and the quality of test materials, as well as norms, reliability, and construct validity (COTAN, 2011, 2013). Reliability was assessed by determining the measurement accuracy (Verhelst, Glas, & Verstralen, 1995) of the tests, which ranged from .84 to .94. The kindergarten language instrument (Lansink & Hemker, 2012) measures several language‐ comprehension and word‐recognition skills, including receptive vocabulary, comprehension of spoken language, sound and rhyme, recognition of first and last words, phonemic synthesis, and knowledge of written text. Children in grades 1–3 were assessed according to the reading‐

(8)

5

comprehension tests (Feenstra et al., 2010), which were designed to measure a child’s comprehension of written text. The mathematics tests (Koerhuis & Keuning, 2011) were designed to measure emerging numeracy and measure performance in three categories: number sense, measurement, and geometry. The arithmetic and mathematics tests (Janssen et al., 2010) for grades 1–3 measured applied mathematics skills, including number knowledge and basic operations (addition, subtraction, multiplication, and division); ratios, fractions, and percentages; and measurement, time, and money (with the latter two being added in the second grade). The scores are presented in two formats. The first format is an ability score that ranks both children and items along the same one‐dimensional scale. Separate scales for language and mathematics can be used to compare different tests administered in the period between first and sixth grades and to compare children’s achievement to national norms. The scale on which the kindergarten tests are scored differs from the one used for the other tests. To compare scores over time, we standardized the test scores in such a way that a score of 0 indicates the norm population average, with a standard deviation of 1. The second format is an achievement level derived from the ability scores and indicated by the Roman numerals I (80th_{percentile) to V (20}th_{percentile, ≤P}

20). These achievement levels divide children into five groups, each with a band of 20 percentile points. Procedure The study draws on data collected by the pupil monitoring systems of the participating schools. Data were recovered and transferred by the schools in cooperation with the first author. Test data were collected retrospectively back to preschool from children who had started fourth grade at the time of data collection. Names, exact birthdates, and other information that could be used to identify specific schools or children were not collected, in order to guarantee the anonymity of the respondents. Analyses As shown in Table 5.1, four methods were used to derive predictions about subsequent scores from the available information, based on the two assumptions of stability and the amount of information used. The first two methods (i.e., last‐score and mean‐score predictions; see the upper row of Table 1) are based on the assumption of linear stability. The methods are reminiscent of the beliefs of Ria and Mona, respectively, and they are related to the idea that children retain their standing relative to the test norms. The two methods differ with regard to the period over which scores are expected to remain stable. Mean‐score predictions assume that all previous test scores are a reflection of the child’s true rank, while last‐score predictions assume that scores develop according to a Markovian process (i.e., all information about a future score of a given child is contained within his or her current score).

(9)

5

Table 5.1 Methods used to predict the next score from the information available up to this time by stability assumption and available information. Minimal information All information Linear stability† Last score Mean score 1 Function stability† Last growth Growth score Note: †We refer the reader to Chapter 4 for an extensive discussion on these definitions. represents the score for child i at time t The third and fourth methods (last‐growth and growth‐score; see the bottom row of Table 5.1) are based on the assumption of function stability. These predictions reflect the beliefs of Ina and the assumption that the individual growth curve of a child contains relevant information. Similar to last‐score and mean‐score predictions, last‐growth and growth‐score predictions differ with regard to the period over which scores are assumed to remain stable. Whereas growth‐score predictions assume that all previous test scores are a reflection of a child’s growth trajectory, last‐growth predictions consider only growth between the last two test administrations. An example of the mean‐score and growth‐score predictions for a single observed child are presented in the upper and lower rows of Figure 5.1, respectively. The left column displays the predictions (crosses) for the unobserved end‐of‐first‐grade test score (open point) based on the observed middle‐of‐kindergarten and middle‐of‐first‐grade scores (solid point). The error under each assumption (indicated in the upper right of each panel) is calculated by subtracting the observed score from the predicted score, with positive values indicating overestimation and negative values reflecting underestimation of the next observed score. In this example, both methods clearly underestimate the End 1 score. As indicated by the smaller error, however, the mean score provides a slightly more accurate prediction of the scores in Figure 5.1. The second column of Figure 5.1 contains a prediction of the middle‐of‐second‐grade test score by including the observed end‐of‐first‐grade score. As indicated in this column, the growth‐ score prediction underestimates the Mid 2 score, and the mean score provides an accurate prediction. These calculations were performed separately for each child and each test administration. Given that at least two observed scores are required to estimate growth, growth‐ score and last‐growth predictions were available only after the middle of first grade. For each method, we assume that predictions are contained within the range of observed scores.

(10)

5

Figure 5.1. Illustration of predictions and errors made according to mean scores (upper panels) and growth scores (lower panel) for the same child at different test administrations. The available scores are indicated by solid points, the predictions based on these scores by crosses, and the next observed score by an open point. Predictions outside of this range are restricted to the minimum or maximum observed score in the dataset. In addition, in keeping with our goal of arriving at plausible predictions from the teacher’s perspective, missing data are treated as missing throughout the entire study, and predictions are based on the data that are at the teacher’s disposal. After calculating errors for each method, child, and time point, the average prediction bias (mean error) and deviations (standard deviation of the error) of the different methods were compared. The mean absolute error was then calculated separately for each child (MAEi). Whereas the relative errors describe bias and relative deviation from the observed score, the MAEi provides a representation of the average absolute deviation from the observed values for each child according to the various prediction methods (Willmott & Matsuura, 2005). In the example presented in Figure 5.1, the MAEi is (0.95 + 0.41 + 0.47)/3 = 0.61 for the growth‐score predictions and (0.76 + 0.00 + 0.73)/3 = 0.50 for the mean‐score predictions. Given that the MAEi for each method has a folded normal distribution (e.g., Tsagris, Beneki, & Hassani, 2014) with only positive values, the median MAEi was used to compare different methods. The MAEi values with and without the kindergarten test were compared to examine the relative benefit of including kindergarten scores for purposes of prediction. Furthermore, the methods were compared for children who had been identified in the previous chapter as having extreme growth rates. Finally, the recommended identification criteria in the test manual were explored. Firstly, by comparing the frequency of score stagnations between children with extreme growth rates and the rest of the sample. Secondly, by comparing the estimated probability of a score below the 20th percentile for children who obtained a commensurate score in kindergarten to the same probability for children who obtained a higher score in kindergarten.

(11)

5

Results

The number of observations at each test administration for mathematics and language are presented in Table 5.2, along with the means and standard deviations of the standardized scores. In all, the kindergarten mathematics test was administered to 437 children, and the kindergarten language test was administered to 897 children. No more than 3.2% of the subsequent mathematics observations were missing. The language tests were administered less consistently, with 2.8% to 24.5% of subsequent observations missing. Nevertheless, 85.6% of the children were tested on at least four occasions. Table 5.2 Number of observations at each test administration, along with mean and standard deviation for the standardized scores, proportion of stagnated scores, and scores below the 20th_percentile. Mathematics Language N Z score (SD) Prop. Stagnate Prop. < P20 N Z score (SD) Prop. Stagnate Prop. < P20 Mid K 437 0.19 (1.02) . .14 897 0.48 (1.10) . .07 Mid 1 436 0.26 (1.11) . .13 . . . End 1 436 0.20 (1.10) .16 .17 677 0.33 (1.04) . .11 Mid 2 434 0.25 (1.08) .21 .13 736 0.22 (1.10) .26 .15 End 2 434 0.27 (1.08) .08 .13 803 0.22 (1.13) .32 .17 Mid 3 423 0.24 (1.07) .15 .15 872 0.12 (1.12) .31 .19 End 3 425 0.24 (1.08) .19 .13 . . . . Note: There are no language instruments for the middle of first grade and the end of third grade. Stagnating scores can be identified only after the middle and end of first grade for language and mathematics, respectively. The standardized scores presented in Table 5.2 indicate how children scored relative to the population. The norm population average was set to 0, and the population standard deviation was set to 1. As indicated by the positive mean scores, the children in this sample scored slightly above average on all tests. This was particularly true for the kindergarten language test, as reflected in the lower than expected proportions scoring below the 20th_{percentile. Finally, the table shows the} proportion of Table 5.3 Mean (SD) of prediction errors and median of the Mean Absolute Error for all children after the middle of first grade for mathematics and language, split by prediction method. Mathematics Language

Error mean (SD) Median MAEi Error mean (SD) Median MAEi

Last score 0.009 (0.743) 0.502 0.078 (0.881) 0.605

Last growth 0.007 (1.194) 0.798 0.007 (1.260) 0.810

Mean score –0.004 (0.682) 0.473 0.183 (0.834) 0.588

Growth score 0.032 (0.877) 0.602 –0.003 (0.985) 0.675

(12)

5

children who had stagnated in or declined from their original ability scores (without standardization). In other words, it identifies children with ability scores either less than or equal to their previous ability scores. Such stagnation in scores is evidently quite common, as an average of 16% and 30% of the scores reflected stagnation at any given time for mathematics and language, respectively. Figure 5.2. Plot of prediction errors (y‐axis) for mathematics (upper row) and language (lower row), by method (columns) and test administration (x‐axis) The average errors of predictions for each of the four methods are displayed in Table 5.3. As indicated in the table, the bias was small for most predictions. Nevertheless, language predictions based on linear stability (last score and mean score) generally overestimated a child’s next score. This was particularly the case with mean‐score predictions. This is a logical consequence of the overall higher scores on the kindergarten language test. As demonstrated in Figure 5.2, this bias leads to an overestimation of the first‐grade score based on the last‐score prediction method, and it influences all subsequent mean‐score predictions. As is also shown in the figure, the errors of most methods tend to decrease over time. This finding is in line with the expectation that more information enhances the accuracy of predictions, and that scores tend to be more stable at later ages. The variance of predictions is lowest for mean‐score and last‐score predictions and highest for last‐ growth predictions. The median MAEi values for last‐growth predictions are also the highest by far,

(13)

5

meaning that this method generally leads to the most inaccurate predictions. The best overall predictions (lowest MAEi values) are made according to the last‐score and mean‐score estimates. Expressed in percentile scores, the best prediction of the next mathematics score (mean‐ score prediction) had a median MAEi of 13 percentile points. This result means that, for 50% of the children in this sample, the average was off by more than 13 percentile points. The last‐score method performed equally well and, for language, the predictions generated by the best methods were off by a median of 16 percentile points. The median MAEi values for the predictions of mathematics scores were 20 percentile points (last‐growth method) and 15 percentile points (growth‐score method), with values of 20 and 18 percentile points respectively, for language‐score predictions. Table 5.4 Median MAEi (MAD) values for the various methods. Compared for predictions with and without the kindergarten tests, as well as for predictions for the linear and function groups identified in Chapter 4. Math predictions Without kindergarten* With kindergarten* Average growth (n = 384) Distinct growth (n = 53) Last 0.50 (0.244) 0.52 (0.231) Last growth 0.80 (0.417) 0.79 (0.380) Mean 0.45 (0.212) 0.45 (0.216) 0.44 (0.187) 0.73 (0.187) Growth 0.67 (0.308) 0.57 (0.271) 0.61 (0.266) 0.56 (0.235) Language predictions (n = 783) (n = 114) Last 0.59 (0.348) 0.64 (0.288) Last growth 0.82 (0.564) 0.77 (0.447) Mean 0.55 (0.367) 0.53 (0.350) 0.53 (0.298) 0.99 (0.356) Growth 0.66 (0.417) 0.86 (0.589) 0.69 (0.366) 0.57 (0.402) Note: * These MAEi values were calculated for predictions after the end of first grade (mathematics) and middle of second grade (language), in order to obtain a fair comparison. As shown in the first two columns of Table 5.4, the exclusion of the kindergarten tests from the mean‐score predictions had hardly any influence on the accuracy of predictions. The mean MAEi values for both mathematics and language were virtually the same with and without the kindergarten scores. For the growth‐score predictions, however, the exclusion of the kindergarten test scores did affect predictions. While the predictions of mathematics scores became less accurate when the kindergarten test was excluded, the reverse was true for language‐score predictions, due to the biased predictions derived from the inflated kindergarten language scores. In Chapter 4 of this dissertation, we identified a group of children with a growth rate in mathematics (n = 169) and language (n = 150) that was significantly distinct from that of other children. For this group, the growth‐score predictions were slightly more accurate in comparison to those for other children in the sample, as demonstrated by the last two columns of Table 5.4. Moreover, the mean‐score predictions for this group were far less accurate in comparison to those of other children in the sample. For the majority of children in this group, however, the best predictions

(14)

5

of the next mathematics score were based on the last achieved score (lowest MAEi for 55% of children). For language, the last‐score predictions (lowest MAEi for 31%) performed closest to the growth‐score predictions (lowest MAEi for 53%). Up to this point, we have examined the general assumptions under which the next test score may be predicted. Given that the manual specifically recommends identifying children who have either stagnated or scored in the lowest 20%, we explored the occurrence of these particular events is in greater detail. Children were more likely to have had a stagnated score at least once within the period studied than they were to show no stagnation at all. As shown in Table 5.5, 60% of the children in this sample exhibited at least one stagnation in their mathematics scores, and 66% experienced stagnation in their language scores. In accordance with expectations, children with an overall extreme negative growth rate were more likely to stagnate than those with more positive growth rates. The difference in the incidence of stagnation was nevertheless too small to warrant using stagnation as a criterion for identifying children with lower than expected growth rates. The overall probability that a child would have an extreme negative growth rate was .05 for mathematics and .09 for language. A single observed stagnation changed this probability to .06 for mathematics, while leaving the probability for language unchanged at.09. Although three observed stagnations did drastically increase the probability that a child would have an extreme negative growth rate (.33 and .17 resp.), the number of children on which these probabilities were based is too small to warrant any definitive claims. Table 5.5 Frequency of stagnations in mathematics and language scores among children with average, extreme negative and extreme positive expected growth rates. Mathematics Language Number of stagnations Total Average growth Negative growth Positive growth Total Average growth Negative growth Positive growth 0 175 151 7 17 305 271 21 13 1 187 161 11 15 467 406 41 20 2 72 70 2 0 119 101 17 1 3 3 2 1 0 6 5 1 0 N 437 384 21 32 897 783 80 34 To use a score in the lowest 20% (≤P20) as an identifier, the estimated probability of remaining in this score can be considered. Given a child who has scored ≤P20 at any time, the probability of the next score also being ≤P20 would be .625 for mathematics and .421 for language. Of

the 60 children who scored ≤P20 on the kindergarten mathematics test, 25.0% scored ≤P20 on every

subsequent test, while 18.3% never scored ≤P20 again. For the 59 children scoring ≤P20 on the

kindergarten language test, these percentages were 13.8% and 37.9%, respectively. According to these results, a score ≤P20 on the mathematics test was apparently a better indicator of subsequent

(15)

5

scores of ≤P20 than was a commensurate score compared on the kindergarten language test. Given a

score of ≤P20 in kindergarten, the estimated probability that any subsequent score would be ≤P20 was

.514 for mathematics and .392 for language. If the kindergarten score was >P20, the estimated probability that any subsequent score would be ≤P20 was drastically lower: .083 and for mathematics and .141 for language. Although scores in kindergarten can thus serve as indicators of subsequent scores ≤P20 scores, they are far from being a flawless measure.

Discussion

The current chapter examines several predictions of test scores, as derived from differing assumptions about the stability of these scores, taking into account differing amounts of information. To approach the problem from the perspective of the teacher, this chapter explores prospective predictions from the available data. By using only the information that was available to teachers who use these tests in their decision‐making processes, this study was designed to answer questions that these teachers face on a daily basis. Examples of such questions might include: “Should I be concerned if a child’s score is below the 20th_{percentile?” “Should I be worried if a child’s scores} stagnate?” “If a child has a low score but has grown much since the previous score, does this mean that the child will continue growing at this rate?” Although conclusive answers would need to consider many other factors, the results from this chapter provide empirical evidence that may help to resolve these questions. In line with the findings in Chapter 4, the results indicate that considering the growth of individual children generally does not enhance the accuracy of predictions of the next score. Moreover, the growth taking place between the last two test administrations is generally highly inaccurate as a predictor of subsequent growth. While the results in Chapter 4 showed that the assumption of individual growth curves does not significantly improve the description of the data, these results indicate that, in terms of prediction accuracy, this assumption is actually less accurate than the assumption that a given child will grow according to the average growth rate. In fact, the last score obtained provides an indication of the following score that is nearly as good as (and often better than) predictions based on other more complex assumptions. Over a longer period, using the mean of the previous scores as an indication of a child’s ability level produces predictions that are slightly more accurate than predictions based only on the last score, although the difference between two approaches is small. Although the last score and mean score are the most accurate indicators of the next observed score, teachers should not be surprised if actual mathematics and language scores differ from the expected scores by as much as 13 or 16 percentile points, respectively. Prediction errors up to this magnitude were average for about 50% of the children examined in this chapter. Including the

(16)

5

kindergarten test in mean‐score predictions does not affect the accuracy of these predictions. Although a small group of children have a substantially different growth rate (Chapter 4), the last score obtained continues to provide the best overall indication of the next score for many of these children (55% for mathematics and 31% for language). This conclusion corresponds to our earlier finding in Chapter 4 that structural changes in growth are difficult to detect over a short period. It further emphasizes the fact that short‐term growth is a poor indicator of subsequent growth. As such, short‐term growth does not provide a solid basis for identifying children who are at risk. Similarly, although stagnating scores occur more frequently in children with overall lower growth rates, they provide only a weak indicator of generally retarded growth. As a predictor of scores below the 20th_{percentile, the second recommendation – to identify} children scoring below this cutoff – provides a somewhat consistent indicator. For mathematics, a child who has scored below the 20th_{percentile in kindergarten is far more likely to score below the} 20th_{percentile in later grades, although a considerable proportion of children (18.3%) obtained} higher scores at all other mathematics test administrations. Scores on the kindergarten language test tended to be substantially higher than any subsequent test scores. Only 13.8% of children in the current sample who scored below the 20th_{percentile on the language test remained in this score} category, as compared to 25.0% on the mathematics test. Compared to the kindergarten mathematics test, the kindergarten language test is clearly a less reliable indicator of later scores ≤P20 scores. This result is in line with findings from other studies (e.g., Duncan et al., 2007), which report a stronger relationship between emergent mathematics skills and later mathematics achievement than between emergent and later language skills. The findings of this chapter are subject to several limitations. First, while this study aimed at providing realistic estimates from the perspective of the practitioner, more accurate predictions can presumably be made with more complex models. Although this chapter compares several simple score interpretations, it is important to note that teachers may consider other information (e.g., from their own observations) that was not addressed in this chapter. As such, the results of this chapter should be seen as a practical evaluation of one possible source of information. In addition, it is important to note that each test score has a specific measurement error, which was not taken into account. Although this means that some apparent stagnations were the result of measurement error and some small growth rates might have actually been stagnating scores, the observed score provides the best available estimate. While rank scores do provide an indication of future rank scores, the use of such scores as a criterion for identifying at‐risk children is inherently arbitrary. Given that test norms are designed in such a way that 20% of children are expected to score within the category identified in the test manual as “at‐risk,” scores in the lowest 20% do not necessarily reflect problematic development, as

(17)

5

they are an inevitable result of the test design. Conversely, the use of these tests with the goal to move children out of this “at‐risk” category inevitably leads to the inflation of norms, as has already been observed in the kindergarten language tests and older versions of the LOVS tests (Keuning et al., 2014). These inflated scores are unlikely to have been caused by sampling bias, as the bias is situated primarily in the older tests. The results of this chapter indicate that basing predictions on individual growth in scores over a short period is likely to generate inaccurate conclusions. Although these instruments do provide an indication of future performance, they do not provide a definite criterion for identifying children as being developmentally at‐risk. The inclusion of evidence such as the results presented in this chapter in professional development for current and future teachers could encourage a much‐needed critical appraisal of the limitations of interpreting test scores.