• No results found

The perspective of “limited malleability” in educational effectiveness: treatment effects in schooling

N/A
N/A
Protected

Academic year: 2021

Share "The perspective of “limited malleability” in educational effectiveness: treatment effects in schooling"

Copied!
21
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=nere20

Educational Research and Evaluation

An International Journal on Theory and Practice

ISSN: 1380-3611 (Print) 1744-4187 (Online) Journal homepage: http://www.tandfonline.com/loi/nere20

The perspective of “limited malleability” in

educational effectiveness: treatment effects in

schooling

Jaap Scheerens

To cite this article: Jaap Scheerens (2018): The perspective of “limited malleability” in educational effectiveness: treatment effects in schooling, Educational Research and Evaluation, DOI:

10.1080/13803611.2017.1455286

To link to this article: https://doi.org/10.1080/13803611.2017.1455286

© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group

Published online: 02 Apr 2018.

Submit your article to this journal

Article views: 59

View related articles

(2)

The perspective of

“limited malleability” in educational

effectiveness: treatment effects in schooling

Jaap Scheerens

University of Twente, Enschede, The Netherlands

ABSTRACT

In this article, several ways to adjust gross school effects are discussed to set the stage for estimating treatment effects in schooling. Although it is quite hazardous to hypothesize realistic benchmarks for results from meta-analyses, because of the dependency of effect sizes on subject matter area, grade level, and study characteristics, a treatment effect of d = .20 can be considered as educationally meaningful. Commonly addressed effectiveness-enhancing school factors show much inconsistency across meta-analyses and frequently small effects (in the order of d = .10). An analysis of the conceptual and methodological foundations of defining and measuring treatment effects in schooling indicates that flaws in the measurement of treatments could be an additional cause of small effect sizes in educational effectiveness research. Results of this review are analysed from the perspective of“limited malleability” in educational effectiveness.

KEYWORDS

Educational effectiveness; effect sizes; absolute and relative effects of schooling

Introduction

The effects of students’ country, educational jurisdiction, school, teacher, and educational programmes can be considered treatment effects. But since the influence of being assigned to a certain unit would be global and unspecified, it is not normally called a treat-ment. Treatment effects are specific aspects of schooling and teaching, or educational pro-grammes that may be associated with, or intended to improve, student outcomes. Treatment effects include such diverse variables as student–staff ratios, school resources, administrative arrangements, teacher qualifications, and experience in addition to specific educational interventions or programmes. Educational research often separates hypothe-tical causes of performance differences into“given”, “contextual”, “endogenous”, or simply “prior” conditions, on the one hand, and malleable factors, or treatments, on the other hand. This distinction corresponds to the difference between “gross” and “net” or “value-added” school effects. In this article, the first two interpretations (gross and net school effects) are referred to as“unit effects”, and the term “treatment” will be reserved for the influence of malleable factors. Treatments defined in this way include both inter-vention programmes and policies and practices hypothesized (or believed) to enhance

© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, or built upon in any way.

CONTACT Jaap Scheerens j.scheerens@utwente.nl University of Twente, Enschede, The Netherlands https://doi.org/10.1080/13803611.2017.1455286

(3)

educational performance. In the most commonly used models to estimate“net” or “value-added” school effects or treatments, school effects or treatments are residual effects after adjusting for prior conditions such as students’ socioeconomic status (SES), intelligence, prior achievement, aptitude, other student characteristics, and sometimes school factors. Generally, adjustment factors have stronger impacts on student performance than mal-leable factors (e.g., Organisation for Economic Co-operation and Development [OECD],

2005; Timmermans,2012). Several contributions in this special issue provide confirming evidence at both the school level and the level of national education systems. At the school level, the most frequently applied value-added models which show strong effects for prior achievement and general intelligence leave limited scope for treatments to improve student performance. The effect size for schools, teachers, and programmes will necessarily be small given that most of the variation is accounted for by the variables indexing initial conditions. This perspective is challenged by value-added models based on progress in educational achievement, which indicate large and very large school effects (Raudenbush,1989; Timmermans and Van der Werf, this special issue).

Given the importance of how“net” school effects are measured, Part 1 of this paper pre-sents an overview of the most often-used value-added models, two of which are based on adjusted student performance status and two depending on progress. In the second part, various facets of the“state of the art” of establishing treatment effects in educational effec-tiveness research are discussed, which have implications for interpreting substantive results from meta-analyses. In the concluding part, implications for educational policy and practice and further research are discussed, and a resume is given of the way the four research papers in this special issue speak to the theme of limited malleability.

Part 1: gross, net, relative, and absolute school effects

In this part, distinctions are made between gross and net, and relative and absolute school effects. Relative gross school effects are the share of total variation in student achievement attributed to schools without adjustments.“Net” school effects represent the share of vari-ation attributable to schools adjusting for prior achievement and other student back-ground characteristics. In approaches that try to separate the effect of a certain period of going to school, in comparison to not going to school, it is attempted to estimate effects of schooling that do not depend on between-school variance. This is why the effects are considered as absolute rather than relative. Examples are the assessment of the effect of summer learning and comparing the achievement of same-age children that attend a higher or a lower grade.

In Part 2 of the paper, school treatment effects are discussed. School treatment effects are the effects of modelled“treatments”, as programmes or a combination of effective-ness-enhancing factors, after adjusting for prior differences between students.

Empirical estimates of gross school effects by means of variance decomposition

The most common way to express the contribution of schools is to express the between-school variation as a percentage of the total variation in student achievement. This corre-sponds to the intra-class correlation (r or rho), which is a proportion rather than the per-centage. As well as the proportion of the variation in student performance attributable to

(4)

schools, the intra-class correlation has the useful interpretation of the expected correlation in performance of two students randomly selected from the same school (Snijders & Bosker, 2012, p. 16). According to Scheerens and Bosker’s (1997, p. 79) meta-analysis, without considering other factors, schools, on average, account for 9% of the variation in student achievement; an intra-class correlation of 0.09. This percentage varies by grade level and subject matter. Between-school differences tend to be larger for math-ematics and science than for reading literacy. It should be noted that between-school differences expressed as percentages of variance or the corresponding intra-class corre-lations are relative measures of educational output quality.

Intra-class correlations can also be expressed as the most commonly used effect size, Cohen’s d (Tymms, 2004). Cohen’s d effect size is about twice the correlation coefficient. Often, it is not clear which measure is being used in educational effectiveness studies.

Relative measures of school effects can also be used to assess the relative contribution of schools versus other aggregated units, such as countries, jurisdictions, classes, and chers. School differences tend to be weaker than differences between classrooms and tea-chers. Kyriakides and Luyten (2009) reported percentages of variance in student achievement in Cyprus attributed to students at 70%, 18% for classrooms, and 13% for schools. Opdenakker and Van Damme (2000) found that 13% of the variance was between schools, 18% between teachers, 15% between classes, and 54% between stu-dents. They point to the fact that omitting intermediary levels (classes, teachers) in multi-level analyses tends to overestimate the effect of the next above multi-level, in this case schools. Hattie (2009, p. 18) provided the following comparative effect sizes (d coefficients) for students, teachers, and schools: student 0.44, teacher 0.49, and school 0.23. He concluded that teachers/classrooms matter more than schools. This pattern is usually found in other studies. International studies include a variance component (or an effect size) for the country level. The OECD (2005) presented a decomposition of total between-student vari-ation, based on Programme for International Student Assessment (PISA) 2000 data, as shown inTable 1.

The total number of OECD countries in PISA 2000 was 27. The overall patternTable 1

shows is that a sizeable amount of variation lies between countries, although less than that between schools.

In a study comparing 11 Australian and British educational systems in math achievement progress, Tymms, Merrell, and Wildy (2005) found more variance associ-ated with the systems (13%) than with the classroom (7%). However, the classroom and the year group, taken together, were associated with more variance (20%) than the system. On the basis of data from the Trends in International Mathematics and Science Study (TIMSS), Kyriakides (2006) found that 20% of the variance was associ-ated with countries.

Table 1.Percentage of variance in student performance in reading and mathematical and scientific literacy in OECD countries. Results from PISA 2000, cited from OECD,2005, p 116.

Percentage at country level Percentage at school level Percentage at student level

Reading 8 15 57

Mathematics 16 31 54

(5)

Relative“net” school effects in terms of adjusted student performance status

As noted in the above, between-school differences and intra-class correlations are about relative differences between schools. These summary statistics do not measure the effect of schooling but rather the relative variation between schools (Scheerens, 2007, Chapter 5).

The variance decomposition methods to estimate relative school effects are more rea-listic if adjustments are made by including a baseline that accounts for initial differences between students. Preferably, initial differences between students are accounted for by adjusting for same-domain prior achievement, but many other variables can be, and are, used to adjust for initial differences between students, such as socioeconomic back-ground, aptitude, and general intelligence. When adjusting for prior achievement, the per-centage of between-school variation of total variation and intra-correlations can be interpreted as the extent that schools matter. Scheerens and Bosker’s (1997) meta-analysis, which included studies that used varying types of adjustment, found that the average gross school effects declined from 9% to 4%. There is a variety of adjustment models: . “Cross-sectional” adjustment for student background conditions, such as aptitude,

socioeconomic status (SES), gender, ethnicity status, and other student variables. . Models estimating school effects net of same-domain prior achievement.

. Combining for prior achievement as well as other student background characteristics. This approach is often denoted as “contextualized value added”, when classroom or school composition variables such as average SES are included (Thomas, Gana, & Muñoz Chereau,2016).

Controlling for same-subject prior achievement provides the simplest value-added interpretation, the contribution of the school net of students’ prior subject knowledge and skills. More complex models control for prior achievement scores in other subject areas. The Education Value-Added Assessment System (EVAAS) model analyses students’ scores in all subjects controlling for students’ prior test scores in all subjects across all grades and years (Wright, White, Sanders, & Rivers,2010).

In one contribution to this special issue, the sole use of socioeconomic status of stu-dents as an adjustment variable is criticized (Marks, this issue). Adjusting for prior achieve-ment is a more justifiable measure, but failing the availability of prior achieveachieve-ment adjusting for students’ intelligence or general aptitude is justifiable. The stronger the cor-relations between the adjustment variables and student achievement, the smaller the value-added school effects (Marks,2015).

In a review article, Detterman (2016) concludes:

Over the last 50 years in developed countries, evidence has accumulated that only about 10% of school achievement can be attributed to schools and teachers while the remaining 90% is due to characteristics associated with students. Teachers account for from 1% to 7% of total variance at every level of education. For students, intelligence accounts for much of the 90% of variance associated with learning gains. (p. 1)

Value-added school effects tend to be small, and they are also not particularly stable. Thomas, Peng, and Gray (2007) analysed school data over a period of 11 years in the English Lancashire district. They concluded that there was some stability in school

(6)

effects. Still, when schools were categorized as average or over- or underachieving, there were many switches, and over a period of 11 years 50% of the schools had changed cat-egory. Contrary to expectations, high value-added schools did not continuously add value over the medium term. Even less stability was found in a Dutch study, where it appeared that of the highest scoring secondary schools, only 15% were still in the top category 3 years afterwards (Vermeer & Van der Steeg,2011). Marks (2015) reported cohort corre-lations of school effects in five domains ranging from a very low 0.10 to 0.30 for primary schools and from 0.16 to 0.50 for secondary schools. The lack of stability under-mines the argument that once schools that substantially add value are identified, they can serve as a model to improve other schools.

School effects in terms of student progress

In the gross and value-added school effects models, discussed above, the dependent vari-able is student achievement. When student achievement and prior achievement are measured on the same scale, gain scores can be calculated as well as students’ predicted achievement and can be used as the criterion variable. Gain scores can be compared to simply adjusting for prior achievement. Based on work by Raudenbush (1989), multilevel random effects models estimate individual students’ growth curves with high levels of reliability. Various studies show substantially larger effect sizes for growth than those obtained from adjusted student performance levels: Rowan, Correnti, and Miller (2002) reported Cohen’s d effect sizes of .53 for reading and .51 for mathematics in growth models. Guldemond and Bosker (2009) found intra-class correlations of 0.3 to 0.5 in growth models. Dumay, Coe, and Anumendem (2014) found that, on average, across cohorts 74% of the total slope variance was accounted for by the school-level slope var-iance, whereas residual gain scores analysis showed a proportion attributable to schools of just 16% on average, across cohorts. Anumendem, De Fraine, Onghena, and Van Damme (2017) showed considerable discrepancies between intra-class correlations expressed as adjusted performance status and growth; according to one of their models, this discrepancy is as high as 0.18 for performance versus 0.66 for the growth model. Palardy (2008) reported more modest discrepancies in the intra-class correlations for performance status and growth (.20 versus .23 in the full sample, 13 versus .25 and .07 versus .14 among low- and high-scoring schools, respectively). School effects obtained from growth models are only weakly associated with students’ socioeconomic status (Dumay et al.,2014; Guldemond & Bosker,2009). Moreover, the school effects obtained from growth models were less stable than school effects obtained by adjusting for initial performance (Dumay et al., 2014). Large school effects were also found in the growth models analysed by Timmermans and Van der Werf in this special issue.

Effects of schooling versus non-schooling

Another approach to rule out that effects of schooling reflect external influences utilizes a counterfactual that represents“non-schooling”. Because such estimates do not depend on between-school variation, they are sometimes referred to as absolute effects of schooling. Examples include evaluating the effects of “summer learning” and specific applications of the regression discontinuity design. The application of the regression

(7)

discontinuity design in educational effectiveness research capitalizes on the fact that, in many countries, being assigned to a school or grade level depends on a specific age level. If the cut-off date that determines what grades students should be placed in is strictly followed, the effect of one-year schooling can be assessed by adjusting the difference in achievement between two (or more) grades in a row for the effect of age. This adjusted difference reflects the gap between the oldest students in the lower grade and the youngest in the upper grade (Cahan & Davis, 1987; Luyten,

2006). In this case, the effect of schooling is zero if the difference in achievement between grades is completely accounted for by the age of the students (Scheerens,

2007, pp. 173–174). In this approach, the “net” effect of schooling is the learning gain between adjacent grade levels that is not explained by age. Applying these methods yielded effect sizes (Cohen’s d) for one year of schooling, in the order of .99 (reading speed), .84 (mathematics), and .33 (when the approach of summer learning was applied) (Scheerens, 2007, Chapter 5). In another application of the regression disconti-nuity design, Kyriakides and Luyten (2009) found effect sizes of .87 for both mathematics and reading. Luyten, Merrell, and Tymms (2017) concluded that the net effect of school-ing is usually less than half of the between-grade discrepancy effect sizes and varies between d = .20 and d = .50, depending on grade level and subject-matter area. Such effect sizes are in the range of small to medium according to Cohen’s benchmarks. On the basis of these methods, it is possible to compute the cumulative effect of school-ing across grade levels; Luyten et al. (2017) reported cumulative effect sizes over six grades of primary schooling of .90 (general ability), 1.35 (mental arithmetic), 1.62 (reading), and 1.92 (general math). Ideally, the school effect, established by means of this application of the regression discontinuity design, would take care of all confound-ing influences from external factors, by controllconfound-ing for age. This assumption can be checked by including additional student-level co-variables or interaction terms in the analyses (Perry, 2017); which would generally result in lower estimates of the school effect. This is demonstrated in the study by Luyten (2006), in which gender and number of books at home were taken into account. This caused the magnitude of the school effects to decrease, although the reduction was relatively small.

Part 2: treatment effects in schooling

How much difference schools, classes, or educational systems make to student achieve-ment was discussed in the previous sections. The next question is to what extent these effects of belonging to a certain unit, or of being treated or not by schooling, can be “explained” by more specific treatments, which are malleable factors, active at specific unit levels. As stated in the introduction, treatments are both intervention programmes and factors in educational policy and practice that are amenable by agents in the edu-cational domain and hypothesized to enhance eduedu-cational performance. Of these two kinds of treatments, the emphasis in this article will not be on intervention programmes but rather on effectiveness-enhancing factors in educational practice. This implies that when discussing methodological issues and study characteristics, there will be a certain emphasis on correlational studies that exploit variance in field settings.

The discrepancy between unit (system, school, classroom) effects and treatment effects is illustrated by Hanushek (2011) with regard to teacher effects:

(8)

Literally hundreds of research studies have focused on the importance of teachers for student achievement. Two key findings emerge. First, teachers are very important. No other measured aspect of schools is nearly as important in determining student achievement. Second, it has not been possible to identify any specific characteristics of teachers that are reliably related to student outcomes. (p. 467)

Similarly, Rivkin, Hanushek, and Kain (2005) concluded that “teachers have powerful effects on reading and mathematics achievement, though little of the variance in teacher quality is explained by observable variables, such as education or experience” (p. 449). Similar conclusions had been reached by Nye, Konstantopoulos, and Hedges (2004). Rockoff, Jacob, Kane, and Staiger (2011) used a broad set of teacher characteristics, which included, in addition to experience and formal qualifications, personality traits as predictors of overall teacher effects. They found limited effects for observable teacher measures. Brandsma (1993) found that only 10% of the student variation was explained by the malleable school variables included in his study on Dutch primary schools. In con-trast, Baumert et al. (2010) found sizable effects for teachers’ “pedagogical content knowl-edge” (d = 0.30) on mathematics achievement net of prior knowledge of mathematics, reading literacy test scores, mental ability, and SES (which had no effect).

The general point is that, although we have ample evidence that“teachers and schools matter”, in terms of variance components, the question on how they matter is more complex. In a subsequent section, we will look at empirical evidence on the size of treat-ments and malleable conditions based on meta-analyses.

Assuming an average effect of a year’s schooling in the order of d = .40 to d = .50 (Hattie,

2009; Luyten et al.,2017), this suggests that individual treatment variables in schooling that show effects in the order of d = .20 are important, while effect sizes around d = .10 would be small. Cohen’s typology would denote these effects as small to very small. The What Works Clearinghouse (2008) in the US considers programmes effective if they produce effect sizes of at least .25. However, average effects may hide considerable vari-ation, as estimates of school effects vary between subject-matter areas and grade levels. Moreover, study characteristics may also influence average effect sizes (see the next section).

Before looking further at substantive results in terms of effect sizes for effectiveness-enhancing malleable variables, some methodological issues are addressed. These meth-odological issues may at least partially explain inconsistencies of treatment effect esti-mates in meta-analyses.

Methodological issues

Study characteristics and treatment effect sizes

Cheung and Slavin (2016) showed that higher effect sizes are associated with researcher-constructed tests rather than independently designed standardized tests, small- rather than large-scale studies, quasi-experimental rather than experimental designs, elementary rather than secondary schools, and published rather than not published work (seeTable 2). Explanations for these differences are a better control of treatments in smaller scale studies, selection bias in quasi-experiments (e.g., in the sense that subjects in treatment groups may be more motivated), and researcher-made instruments more tailored for the treated group. Publication bias occurs because journals are more likely to reject

(9)

studies with small or statistically insignificant effects and authors are reluctant to write up and submit articles with small educational effects.

Educational effectiveness research frequently uses “correlational” non-experimental research designs, given the difficulties in administering true and quasi-experimental designs. Longitudinal studies can separate the effects of malleable conditions from “given” antecedents, such as prior achievement, but in cross-sectional studies, such as international assessment studies, this is not possible. The advantages of using school effec-tiveness models in non-experimental causal modelling is that effects of specific school effectiveness-enhancing factors can be identified. In this way, correlational studies could be seen as having potentially high practical value, as they depend on on-going edu-cational practice. The major problem is the low internal validity, as it is extremely difficult to rule out all disturbing internal and external conditions in such studies. When consider-ing the distinction between experimental and correlational as a study characteristic that might influence effect sizes, low internal validity might create upward as well as down-ward bias in effect-size estimation.

Relatively underutilized theories and conceptual models

In the field of educational effectiveness, there is broad consensus about the strategies and malleable conditions that matter. This is illustrated inTable 3, which shows lists of effec-tiveness-enhancing factors from different sources. The first source are research reviews on educational effectiveness (Hopkins, Stringfield, Harris, Stoll, & Mackay,2014; Muijs et al.,

2014; Reynolds et al., 2014), the second source are components in Comprehensive School Reform (CSR) Programs in the US, and the third source behavioural categories that are part of teacher observation instruments in the Measuring Effective Teaching (MET) study (Kane, McCaffrey, Miller, & Staiger,2013).

The way these sets of factors developed over time is an eclectic process, involving common-sense thinking about school organization and teaching, supported by accumu-lating empirical research evidence, but rarely driven by theory or conceptual models. The influence from more established theories can be discerned in teaching strategies originating from behaviourism or cognitive theory (constructivism), or by organization and planning models (Scheerens, 2013a, 2015). Several conceptual multilevel edu-cational effectiveness models have been around for about three decades (Scheerens,

1992; Stringfield & Slavin,1992) and were also updated in more recent versions (Cree-mers & Kyriakides, 2008; Scheerens, 2016). What these models show are a choice of the factors, very much along the lines of what is illustrated inTable 3, ordered across aggregation levels (system, school, classroom/teachers, students), a distinction between malleable variables and background conditions at each level, assumptions on how factors interact, and sometimes reference to more general underlying Table 2.Study characteristics affecting effect sizes, after Cheung and Slavin (2016, pp. 287–289).

Study characteristic Effect sizes bigger Effect sizes smaller Independence of instruments Researcher-made: .40 Independent: .20 Scale of the study < 250: .30 >250: .16 Study design Quasi-experimental: .23 Experimental: .16 Grade level Elementary: .20 Secondary: 17 Publication statues Published: .30 Grey literature: .16

(10)

dimensions of the specific factors. A basic notion is that malleable conditions at a higher organizational level facilitate malleable factors at lower levels, and an example of a more general underlying dimension is the issue of alignment versus loose coupling of factors within and across levels. Causal modelling in which moderators and mediators are dis-tinguished and explored by means of path analytic techniques is a beginning of empiri-cally testing some of these model assumptions (Desimone & Hill, 2017; Scheerens,

2013b). Best practices in the development of school improvement programmes in which effectiveness-enhancing school and teaching conditions are combined are the earlier cited Comprehensive School Reform (CSR) Programs in the United States and applications of the dynamic model developed by Creemers and Kyriakides (2008). Still, consequently thinking through how combinations of factors are expected to stimulate student learning is a relatively underdeveloped area in this field. On the other hand, we see theoretically ungrounded and empirically unsupported ideas getting a great deal of publicity. In particular, there are currently high expectations on the effects of Table 3.Effectiveness-enhancing factors, from reviews, CSR, and observation of effective teaching.

Research reviews,

Scheerens (2014) CSR (Borman, Hewes, Overman, & Brown,2003)

MET observation categories (Kane, Kerr, & Pianta,

2014, p. 246) Effective Leadership Integrative school management; Shared

leadership

Academic focus Measurable goals for student academic achievement

A positive orderly climate

The majority of students were on task throughout the class.

High expectations Benchmarks for student achievement Monitoring

progress

The teacher used formative assessment effectively to be aware of the progress of all students

Parental involvement

Meaningful involvement of parents and the local community

Staff professional development

High-quality and continuous teacher and staff professional development

Effective teaching (time)

An appropriate amount of time was devoted to each part of the lesson.

Opportunity to learn

Content communicated through direct and non-direct instruction by the teacher is consistent with deep knowledge and fluency with the mathematics concepts of the lesson. The lesson allowed students to engage with or explore important concepts in mathematics (instead of focusing on techniques that may only be useful in exams).

Structuring and scaffolding

The lesson was well organized and structured. Teaching learning

strategies

The lesson included an investigative or problem-based approach to important concepts in mathematics

Pupil involvement/ active teaching

The classroom environment encouraged students to generate ideas, questions, conjectures, and/or propositions that reflected engagement or exploration with important mathematics concepts.

Employs proven methods for student learning, teaching, and school management that are founded on scientifically based research and effective practices and have been replicated successfully in schools

(11)

school organization and management on student performance due to several reports by McKinsey (e.g., Mourshed, Chijioke, & Barber, 2010).

Few standardized instruments to measure effectiveness-enhancing conditions

Despite the fact that the list of input and process factors considered to enhance student achievement has been remarkably consistent for about four decades, there are few stan-dardized measures. This lack of stanstan-dardized instruments is a weakness (e.g., Muijs & Brookman, 2016; Scheerens, 1992). The dominant practice is the development of new instruments for each new study. Two examples of standardized instruments are: the Hal-linger scale, which measures instructional leadership and has been applied in many con-texts and studied in about a hundred research studies (Hallinger & Wang,2015), and the International Comparative Analysis of Learning and Teaching (ICALT) instrument, which measures teaching (Van de Grift, Chung, Maulana, Lee, & Helms-Lorenz,2017).

Possibly reactive and confounded measurement of factors

In contrast to the attention given to the outcome measures, the quality of the instruments that measure treatment variables is rather neglected (lack of conceptual rationales, few standardized instruments, and each study reinventing the wheel).

Self-reports from teachers are vulnerable to social desirability and confounding of process and outcome measures. This is illustrated by the variable“high expectations of student performance”. This variable can be understood as a positive and optimistic atti-tude that stimulates satti-tudent engagement and performance. On the other hand, it may simply reflect teachers’ knowledge about actual student performance (Brophy,1983). Fur-thermore, little attention is given to the interdependence of factors, when several are included in one study. Because of multi-collinearity, the estimate of the effect sizes is dependent on the set of other factors included in the model (Scheerens,2014).

Results from meta-analyses

Empirical results on the effects of treatments in schooling comprise a research literature denoted as school or educational effectiveness research. As already indicated above, reviews of educational effectiveness are generally consistent in listing which treatments and effectiveness-enhancing factors are effective. These include instructional leadership and cooperation between teachers, while a range of other variables are more closely associated with teaching and learning: opportunity to learn, time investment, achieve-ment orientation, parental involveachieve-ment, structured teaching, school and classroom climate, and frequent monitoring and evaluation.

Despite consensus on which factors matter, meta-analyses show considerable diver-gence in mean effect sizes for each of these factors. For example, at the school level, Marzano (2003) reports an average effect size for “opportunity to learn” of 0.88, while Hattie (2009) reports an effect size of 0.39 for a related variable “enrichment pro-grammes for gifted children”. For “monitoring”, Hattie reports an average effect size of .64, while Marzano reports 0.30. “School leadership” has a small effect size of around 0.10, according to Marzano (2003) and Scheerens (2007), whereas Hattie’s (2009) estimate is 0.34. The pattern across these three meta-analyses is similar for “cooperation” but with much smaller coefficients (.06, .04, and .18). At the classroom level, Seidel and Shavelson (2007) reported an effect size as low as .08 for a combination

(12)

of learning time and opportunity to learn, while Hattie’s estimate was 0.36. A recent set of related meta-analyses found small effect sizes of 0.10, 0.10, 0.14, 0.10, and .12 for “learning time”, “homework”, “monitoring”, “assessment”, and “school leadership”, respectively (Hendriks, 2014, Chapter 4; Scheerens, 2012, 2014).1 The small effect size for leadership stands in sharp contrast to the optimistic accounts in the school improve-ment literature (e.g., Day et al., 2009).

Other evidence of treatment effects

The Measuring Teacher Effectiveness study (MET), funded by the Bill & Melinda Gates foun-dation, investigated different methods to assess teacher effectiveness (Kane et al.,2014). Three types of measures were investigated and compared:

(1) past performance of teachers based on value-added achievement (VAM) of students taught in previous school years;

(2) student ratings of teacher quality; (3) observation of teachers’ teaching.

The three methods to assess teaching effectiveness are fundamentally different. VAM studies exhibit much greater predictive validity than the other two approaches. The VAM method is the most statistically defensible approach but does not reveal anything about which aspects of teaching are more effective. With respect to teaching effective-ness, this measure says about as much or as little as the studies that, by means of variance decomposition, established a relevant teacher effect, in the sense of the difference it makes to be taught by one teacher or the next. The student rating measures are subjective ratings of teachers aggregated for each teacher. The observation method is based on observers categorizing behaviours that are considered as stimulating achievement, and is therefore covering specific effectiveness-enhancing teaching factors. Results on the pre-dictive validity of each of these three methods are relevant to our discussion on the size of effective treatments in schooling. Results indicate that the past performance measure based on earlier VAM is by far the best predictor, with standardized regression coefficients in the order of .40, consistent over subject-matter areas and school level (elementary and secondary). Coefficients for student ratings and classroom observation are considerably lower, in the order of .15 and .10, respectively. Moreover, the rating and observational coefficients are much less consistent across subject-matter areas and grade levels (Kane et al., 2014). The practical implication is that observed teacher behaviour is considered to have a predictive validity that is too low to support application of this method as the sole procedure to assess teacher effectiveness. In the context of the present analysis, these findings, particularly the relatively low predictive validity of behavioural observation of effectiveness-enhancing teaching conditions of .10, corroborate previously discussed results all pointing to small treatment effects.

A second interesting case are the results of meta-analyses of Comprehensive School Reform (CSR) Programs (Borman et al.,2003). CSR programmes2are based on best-evi-dence syntheses of educational effectiveness research on specific effectiveness-enhancing conditions. As such, evaluations and meta-evaluations of the results of these programmes are crucial tests of the available knowledge base. According to Borman (2009):

(13)

Our various analyses suggest that… CSR schools can be expected to score between nearly one-tenth and one-seventh of a standard deviation… higher than control schools on achieve-ment tests. The low-end estimate represents the overall CSR effect size of d = .09 for third-party studies using comparison groups, and the high-end estimate represents the effect size of d = .15 for all evaluations of the achievement effects of CSR. (p. 55)

He concludes that large-scale reform is capable of widespread, but only modest, improve-ments in student achievement.

Discussion

Relatively small effect sizes of treatments, long lead time of educational reforms to take effect, and equifinality of treatments representing classroom teaching

Overall effects of schooling, expressed as the cumulative effect of attending 6 years of primary school, are by no means small. Moreover, multiplier effects of school-level con-ditions over large numbers of students should be considered as well. But the effects of malleable aspects of schooling, or treatments, are small when compared to the influence of stable student characteristics. This is underlined in recent studies where adjustments are made for intelligence, aptitude, and/or previous achievement, as demonstrated in other contributions to this special issue. Treatment effects vary across subject-matter area and grade level, so that benchmarks for interpreting effect size should ideally be specific for subject-matter, grade-level combinations. Effects tend to be larger for subject matter that is more exclusively offered at school (e.g., mathematics), as compared to subjects that are supported by “the informal curriculum” of home and other societal contexts (e.g., reading). Results of meta-analyses on the effectiveness of malleable school variables and treatments vary not only because of subject and grade differences but also because of variation in study characteristics, such as externality of achievement measures, scale of the study, and study design.

Despite broad consensus on the kind of factors that“work” in schooling, explicit ratio-nales and models, let alone theory-based conjectures, are often lacking in educational effectiveness research. Besides, there are few standardized instruments to measure the effectiveness of school programmes and interventions, and the wide-spread use of self-reports in those studies is likely to result in respondents providing socially desirable answers.

When taking results from meta-analyses as a basis for assessing the state of the art of treatment effects in schooling, the considerable variation in average effect sizes is not sur-prising given the various sources of this variation, namely, different model specifications, study characteristics, and no standardization of treatment measures. The diversity in results of meta-analyses makes it difficult to draw overall conclusions, but the norm in school effectiveness studies is small effect sizes, below the benchmark of d = .20, which is seen as indicative of “educational significance”. Meta-analyses on effective teaching strategies tend to show higher effect sizes than school effectiveness studies (e.g., De Boer, Donker, & Van der Werf,2014; Donker, De Boer, Kostons, Dignath-Van Ewijk, & Van der Werf,2014; Hattie, Biggs, & Purdie,1996), in the order of d = .40 to .80. At the same time, there are also meta-analyses on teaching factors that show very small effect sizes (Creemers & Kyriakides,2008; Scheerens,2007; Seidel & Shavelson,2007). A hypothetical explanation for the difference in effect sizes between school and teaching effectiveness

(14)

studies could be the larger scale of field studies that take schools and school-level strat-egies as a focus in comparison to smaller scale experimental teaching effectiveness studies.

In teaching effectiveness studies, quite different approaches may show similarly sized positive effects (Hickendorff et al., 2017), sometimes interpreted as“everything works” (Hattie, 2009, p 15). A striking example are studies that find no significant differences between structured teaching strategies and constructivist strategies (Kirschner, Sweller, & Clark,2006; Louis, Dretzke, & Wahlstrom,2010). These kinds of results can be subsumed under the concept or equifinality, as defined in general systems theory (von Bertalanffy,

1968, p. 40). Equifinality is the principle that in open systems a given end state can be reached by many potential means. Equifinality adds to an overall impression of relative “indifference” and “bluntness” in the way schools operate in producing learning progress. If“many roads lead to Rome”, a general sense of direction might be sufficient to reach the destination, albeit slowly. If basic conditions of good schooling are in order (knowledge-able teachers, an orderly environment, general sense of direction), it may not make much difference which specific methods are applied. This conclusion could be taken as urging the search for more basic underlying conditions of good schooling (Scheerens,

2016). Relatively small treatment effects in large-scale field studies and equifinality in teaching effectiveness should temper the high expectations associated with reform agendas expressed in some of the literature on systemic reform and school improvement (Day et al.,2009; Mourshed et al.,2010)

The contribution of the four empirical studies reported in this special issue

The review presented in this paper tends to conclude that malleability in educational effectiveness is limited. The conclusions on “limited malleability” are partly confirmed and partly challenged in the four empirical studies that make up this special issue.

The first paper in this special issue, by Marks, demonstrates small“net” school effects, when adjustments are made for subject-specific and across-subject prior achievement. The outcomes of this study were placed in the perspective of research that examined the impact of heredity and intelligence on educational achievement.

In the second paper, by He, Van de Vijver, and Kulikova, a broad range of “non-edu-cational” contextual influences was correlated with student achievement. It appeared that not all plausible educational policy levers that were considered had the expected positive correlation. This paper can be seen as innovative for studying system-level edu-cational effectiveness, as it lays the foundation for a fuller treatment of cultural conditions that affect educational performance.

In the third paper, by Aloisi and Tymms, average performance of countries shows much stability according to six waves (2000–2015) of OECD’s PISA study. In addition, the paper illustrates that in jurisdictions where relatively large changes occurred, these changes were to a large extent due to contextual and not to educationally malleable factors. Finally, the one educationally malleable variable that was studied in depth, namely, curriculum inno-vation, showed hardly any effect. The methodology that was used in this study has several interesting features, such as the use of growth curve analyses of country-level educational performance and the way“curriculum change” was defined and longitudinally measured in the study. Using country-level longitudinal data to measure progress in educational

(15)

performance makes it possible to define and empirically address the temporal aspect of treatment measurements, in the sense of defining the initiation and the effective duration of a treatment.

In the fourth paper, by Timmermans and Van der Werf, growth curve analysis was applied to a longitudinal data set based on the monitoring of student progress in Dutch primary schools. The results appear to challenge the hypothesis of“limited malle-ability”, as relatively large “value-added” effect sizes were established using growth models. Similar large school effects have been found in other school effectiveness studies that had used growth curves analyses. Dumay et al. (2014, p. 77) argue that the internal validity of the growth model is high, because the growth estimates are more effi-cient to isolate the part of the between-school variance in performance that is less associ-ated with exogenous factors such as the school’s social composition. At the same time, they prove to be less stable, and therefore less reliable to predict future success. Given the relatively limited set of studies on progress using growth curve modelling, there are still open questions about the impact of adding additional control variables. Moreover, high “net” school effects are “unit effects” and not treatment effects as defined in this article. High “net” unit effect sizes do not guarantee high treatment effect sizes, which is illustrated in the study by Rowan et al. (2002). These authors found net school effects in the order of d = .50, and a treatment effect for“opportunity to learn” of .10. Another facet of the specific behaviour of growth curves estimates is the way they should be inter-preted. The more traditional variance composition models facilitate a relatively simple additive interpretation of the impact of contextual variables and treatment effects. Such interpretations could be seen as useful in practical applications such as school self-evalu-ations and external school evaluation. The relatively low stability of the growth curves coefficients over time, as compared to estimates based on adjusted performance status (Dumay et al., 2014; Timmermans & Van der Werf, this issue), would make this method more problematic for such practical applications. But if growth curves analysis would indeed prove the most internally valid way of assessing net effects of schooling, it might become the preferred method in educational effectiveness research.

Conclusion: making up the balance for the“limited malleability” thesis At the end of the line, it has to be acknowledged that, in this paper, the hypothesis of “limited malleability” in educational effectiveness cannot be confirmed or rejected. The exploration that was carried out could not meet the broad range of phenomena and the sheer magnitude of relevant research that are subsumed under the thesis. As com-pared to a fully representative review, this paper’s status is more like a position paper and a possible upbeat for future studies that focus on facets of the overall thesis. The research that was reported leads to some results that support the thesis and other results that challenge it. Notwithstanding the nuances in the summary of main results pre-sented below, there is sufficient ground to keep the hypothesis of“limited malleability” in educational effectiveness on the table for further research and analyses, hopefully also inspiring a more realistic and efficient attitude to system-level educational reform and school improvement.

Value-added school effects tend to be small when co-variance adjustments are made for prior achievement. The argument for the persuasiveness of this method is

(16)

corroborated by studies that found high correlations between student background vari-ables, such as intelligence, and educational achievement. Another supportive indicator for this conclusion about small school effects is high stability of student achievement across time.

According to the logic of value-added analyses, the size of treatment effects depends on the share of total student-level variation that remains unexplained by previous achieve-ment and other student background variables. This sets an upper limit for the size of specific effectiveness-enhancing (treatment) variables and is therefore a straightforward explanation for modest treatment effects, assuming strong influence of the adjustment variables.

Estimates of school effects expressed as student achievement growth over more than two points in time by means of growth curves analyses are astonishingly high. Concep-tually, the coefficients yielded by this method are interesting as they can be thought of as efficiently isolating the part of the between-school variables from exogenous variables (Dumay et al.,2014). Yet, the slope ratios on which the school effects are based may be lacking a sound statistical rationale. The evidence on how the growth effects“behave”, in terms of stability and sensitivity to student background variables, is sparse and, to the degree that it exists, not quite convincing.

Effects that are computed by comparing the learning progress of same-age students in different grades are sometimes indicated as“absolute” school effects, because they do not depend on between-school variance (and are therefore described as effects of schooling). Although they are not strictly to be seen as“value added”, one could say that the compari-son is controlled for age. Effect sizes for a year of schooling are about d = .50, on average. The feasibility of this method of estimating effects of schooling for assessing treatment effects of malleable school factors depends on there being sufficient eligible students per school, which might prove to be difficult in practice.

The field of educational effectiveness research manifests a relatively strong consensus on which factors matter. Meta-analyses show strongly diverging effect sizes per factor. Effect sizes for treatment variables differ for subject matter area, grade level, study charac-teristic, and for school organizational as compared to teaching factors. This diversity makes it extremely difficult to draw average conclusions about effect sizes being high or low, although the logic of“narrow margins” resulting from properly conducted value-added analyses makes modest effects more plausible.

Effect sizes of school-level malleable conditions tend to be lower than factors that measure teaching characteristics. The fact that school effectiveness studies are more often large-scale field studies with non-experimental designs and teaching effectiveness studies are frequently small-scale experimental or quasi-experimental studies might par-tially explain this.

The impression that, in teaching effectiveness studies, approaches based on strongly different schools of thought (structured teaching vs. constructivism) show quite similar effect sizes can be seen as proof of“equifinality”, different methods equally effective in attaining a fixed goal. Equifinality could be seen as indicative of schooling being relatively robust with respect to all kinds of methodological variations, conditional on the presence of basic conditions of well-functioning. Equifinality is another way of saying that edu-cational systems are difficult to change in ways that matter, and as such just another mani-festation of limited malleability.

(17)

Conservation and stability is also what is demonstrated in studies that compared growth or decline of country-level achievement in international data sets. The study by Aloisi and Tymms (this issue) clearly demonstrates this, as have other analyses. Empirical studies that have analysed the effects of system-level levers of educational performance represent a relatively young field, so far not showing convincing effects, as effect sizes tend to be very small and inconsistent between studies. The two studies in this special issue that carried out system-level analyses (Aloisi & Tymms, this issue; He et al., this issue) underline the importance of conditions and influences outside the educational pro-vince (such as cultural factors) that may complicate assessing the impact of educational policy measures.

Relatively small treatment effects in large-scale field studies, high stability in the per-formance of educational systems over time, and equifinality in teaching effectiveness call for a more conservative attitude toward educational reform and school improvement. Little is to be expected of governments turning out a continuous stream of short-lived policy initiatives and“innovations”. A similar kind of overproduction of change in organ-izational conditions at the school level under the heading of“continuous improvement” is likely to have disappointing results as well. Maintaining and incrementally improving educational performance is best served by consistent support of basic conditions of good schooling close to the primary processes of teaching and learning, in the realm of teacher training, curriculum alignment, performance monitoring, and formative assessment.

Notes

1. The effect sizes in the section“Results from meta-analyses” are d coefficients.

2. Comprehensive school reforms are based upon scientifically based research and effective practices that include an emphasis on basic academics and parental involvement so that all children can meet challenging academic content and academic achievement standards.

Acknowledgement

The author is indebted to Gary N. Marks for input and comments on previous versions of this article.

Disclosure statement

No potential conflict of interest was reported by the authors.

References

Anumendem, N. D., De Fraine B., Onghena, P., & Van Damme, J. (2017). Growth in reading compre-hension and mathematics achievement in primary school: A bivariate transition multilevel growth curve model approach. Biometrics & Biostatistics International Journal, 5(4): 00137.doi:10.15406/ bbij.2017.05.00137

Baumert, J., Kunter, M., Blum, W., Brunner, M., Voss, T., Jordan, A.,… Tsai, Y.-M. (2010). Teachers’ math-ematical knowledge, cognitive activation in the classroom, and student progress. American Educational Research Journal, 47(1), 133–180.doi:10.3102/0002831209345157

Borman, G. D. (2009, March). National efforts to bring reform to scale in America’s high-poverty elemen-tary and secondary schools: Outcomes and implications. Paper commissioned by the Center on

(18)

Education Policy, Washington, DC for its project on Rethinking the Federal Role in Education. Retrieved fromhttps://files.eric.ed.gov/fulltext/ED504825.pdf

Borman, G. D., Hewes, G. M., Overman, L. T., & Brown, S. (2003). Comprehensive school reform and achievement: A meta-analysis. Review of Educational Research, 73(2), 125–230. doi:10.3102/ 00346543073002125

Brandsma, H. P. (1993). Basisschoolkenmerken en de kwaliteit van het onderwijs [Characteristics of primary schools and the quality of education]. Groningen: RION.

Brophy, J. (1983). Research on the self-fulfilling prophecy and teacher expectations. Journal of Educational Psychology, 75(5), 631–661.doi:10.1037/0022-0663.75.5.631

Cahan, S., & Davis, D. (1987). A between-grade-levels approach to the investigation of the absolute effects of schooling on achievement. American Educational Research Journal, 24(1), 1–12.doi:10. 3102/00028312024001001

Cheung, A. C. K., & Slavin, R. E. (2016). How methodological features affect effect sizes in education. Educational Researcher, 45(5), 283–292.doi:10.3102/0013189X16656615

Creemers, B. P. M., & Kyriakides, L. (2008). The dynamics of educational effectiveness: A contribution to policy, practice and theory in contemporary schools. London: Routledge.

Day, C., Sammons, P., Hopkins, D., Harris, A., Leithwood, K., Gu, Q.,… Kington, A. (2009). The impact of school leadership on pupil outcomes (Research Report No. DCSF-RR108). London: Department for Children, Schools and Families.

De Boer, H., Donker, A. S., & Van der Werf, M. P. C. (2014). Effects of the attributes of educational inter-ventions on students’ academic performance: A meta-analysis. Review of Educational Research, 84(4), 509–545.doi:10.3102/0034654314540006

Desimone, L. M., & Hill, K. L. (2017). Inside the black box: Examining mediators and moderators of a middle school science intervention. Educational Evaluation and Policy Analysis, 39(3), 511–536.

doi:10.3102/0162373717697842

Detterman, D. K. (2016). Education and intelligence: Pity the poor teacher because student charac-teristics are more significant than teachers or schools. The Spanish Journal of Psychology, 19(E93), 1–11.doi:10.1017/sjp.2016.88

Donker, A. S., De Boer, H., Kostons, D., Dignath-Van Ewijk, C. C., & Van der Werf, M. P. C. (2014). Effectiveness of self-regulated learning strategies on academic performance: A meta-analysis. Educational Research Review, 11, 1–26.

Dumay, X., Coe, R., & Anumendem, N. D. (2014). Stability over time of different methods of estimating school performance. School Effectiveness and School Improvement, 25(1), 64–82. doi:10.1080/ 09243453.2012.759599

Guldemond, H., & Bosker, R. J. (2009). School effects on students’ progress – A dynamic perspective. School Effectiveness and School Improvement, 20(2), 255–268.doi:10.1080/09243450902883938

Hallinger, P., & Wang, W.-C. (2015). Assessing instructional leadership with the Principal Instructional Management Rating Scale. Heidelberg: Springer.

Hanushek, E. A. (2011). The economic value of higher teacher quality. Economics of Education Review, 30(3), 466–479.doi:10.1016/j.econedurev.2010.12.006

Hattie, J. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Abingdon: Routledge.

Hattie, J., Biggs, J., & Purdie, N. (1996). Effects of learning skills interventions on student learning: A meta-analysis. Review of Educational Research, 66(2), 99–136.doi:10.3102/00346543066002099

Hendriks, M. A. (2014). The influence of school size, leadership, evaluation, and time on student out-comes: Four reviews and meta-analyses (Doctoral thesis). Enschede: University of Twente. Hickendorff, M., Mostert, T. M. M., Van Dijk, C. J., Jansen, L. L. M, Van der Zee, L. L., & Fagginger Auer,

M. F. (2017). Rekenen op de basisschool: Review van de samenhang tussen beïnvloedbare factoren in het onderwijsleerproces en de rekenwiskundeprestaties van basisschoolleerlingen [Arithmetic in primary schools: Review of the association between malleable factors in the teaching and learning process and mathematics performance of primary school students]. Leiden: University of Leiden. Hopkins, D., Stringfield, S., Harris, A., Stoll, L., & Mackay, T. (2014). School and system improvement: A narrative state-of-the-art review. School Effectiveness and School Improvement, 25(2), 257–281.

(19)

Kane, T. J., Kerr, K. A., & Pianta, R. C. (2014). Designing teacher evaluation systems: New guidance from the Measures of Effective Teaching Project. San Francisco, CA: Jossey-Bass.

Kane, T. J., McCaffrey, D. F., Miller, T., & Staiger, D. O. (2013). Have we identified effective teachers? Validating measures of effective teaching using random assignment (Research paper). Seattle, WA: Bill & Melinda Gates Foundation.

Kirschner, P. A., Sweller, J., & Clark, R. E. (2006). Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching. Educational Psychologist, 41(2), 75–86.doi:10.1207/s15326985ep4102_1

Kyriakides, L. (2006). Using international comparative studies to develop the theoretical framework of educational effectiveness research: A secondary analysis of TIMSS 1999 data. Educational Research and Evaluation, 12(6), 513–534.doi:10.1080/13803610600873986

Kyriakides, L., & Luyten, H. (2009). The contribution of schooling to the cognitive development of sec-ondary education students in Cyprus: An application of regression discontinuity with multiple cut-off points. School Effectiveness and School Improvement, 20(2), 167–186. doi:10.1080/ 09243450902883870

Louis, K. S., Dretzke, B., & Wahlstrom, K. (2010). How does school leadership affect student achieve-ment: Results from a national US survey. School Effectiveness and School Improvement, 21(3), 315– 336.doi:10.1080/09243453.2010.486586

Luyten, H. (2006). An empirical assessment of the absolute effect of schooling: Regression-disconti-nuity applied to TIMSS-95. Oxford Review of Education, 32(3), 397–429. doi:10.1080/ 03054980600776589

Luyten, H., Merrell, C., & Tymms, P. (2017). The contribution of schooling to learning gains of pupils in Years 1 to 6. School Effectiveness and School Improvement, 28(3), 374–405.doi:10.1080/09243453. 2017.1297312

Marks, G. N. (2015). The size, stability, and consistency of school effects: Evidence from Victoria. School Effectiveness and School Improvement, 16(3), 397–414.doi:10.1080/09243453.2014.964264

Marzano, R. J. (2003). What works in schools: Translating research into action. Alexandria, VA: Association for Supervision and Curriculum Development.

Mourshed, M., Chijioke, C., & Barber, M. (2010). How the world’s most improved school systems keep getting better. London: McKinsey and Company.

Muijs, D., & Brookman, A. (2016). Quantitative methods. In C. Chapman, D. Muijs, D. Reynolds, P. Sammons, & C. Teddlie (Eds.), The Routledge international handbook of educational effectiveness and improvement (pp. 173–201). Abingdon: Routledge.

Muijs, D., Kyriakides, L., Van der Werf, G., Creemers, B., Timperley, H., & Earl, L. (2014). State of the art – Teacher effectiveness and professional learning. School Effectiveness and School Improvement, 25(4), 231–256.doi:10.1080/09243453.2014.885451

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257.doi:10.3102/01623737026003237

Opdenakker, M.-C., & Van Damme, J. (2000). Effects of schools, teaching staff and classes on achieve-ment and well-being in secondary education: Similarities and differences between school out-comes. School Effectiveness and School Improvement, 11(2), 165–196.

Organisation for Economic Co-operation and Development. (2005). School factors related to quality and equity: Results from PISA 2000. Paris: Author.doi:10.1787/9789264008199-en

Palardy, G. J. (2008). Differential school effects among low, middle, and high social class composition schools: A multiple group, multilevel latent growth curve analysis. School Effectiveness and School Improvement, 19(1), 21–49.doi:10.1080/09243450801936845

Perry. T. (2017). Inter-method reliability of school effectiveness measures: A comparison of value-added and regression discontinuity estimates. Effectiveness and School Improvement, 28(1), 22– 38.doi:10.1080/09243453.2016.1203799

Raudenbush, S. W. (1989). The analysis of longitudinal, multilevel data. International Journal of Educational Research, 13(7), 721–740.doi:10.1016/0883-0355(89)90024-4

Reynolds, D., Sammons, P., De Fraine, B., Van Damme, J., Townsend, T., Teddlie, C., & Stringfield, S. (2014). Educational effectiveness research (EER): A state-of-the-art review. School Effectiveness and School Improvement, 25(2) 197–230.doi:10.1080/09243453.2014.885450

(20)

Rivkin, S. G., Hanushek, E. A., & Kain, J. F. (2005). Teachers, schools, and academic achievement. Econometrica, 73(2), 417–458.doi:10.1111/j.1468-0262.2005.00584.x

Rockoff, J. E., Jacob, B. A., Kane, T. J., & Staiger, D. O. (2011). Can you recognize an effective teacher when you recruit one? Education Finance and Policy, 6(1), 43–74.doi:10.1162/EDFP_a_00022

Rowan, B., Correnti, R., & Miller, R. J. (2002). What large-scale,survey research tells us about teacher effects on student achievement: Insights from the Prospects Study of elementary schools. Teacher College Record, 104(8), 1525–1567.doi:10.1111/1467-9620.00212

Scheerens, J. (1992). Effective schooling: Research, theory and practice. London: Cassell.

Scheerens, J. (with Luyten, H., Steen, R., & Y. Luyten-de Thouars). (2007). Review and meta-analyses of school and teaching effectiveness. Enschede: University of Twente, Department of Educational Organisation and Management.

Scheerens, J. (Ed.). (2012). School leadership effects revisited: Review and meta-analysis of empirical studies. Dordrecht: Springer.

Scheerens, J. (2013a). The use of theory in school effectiveness research revisited. School Effectiveness and School Improvement, 24(1), 1–38.doi:10.1080/09243453.2012.691100

Scheerens, J. (2013b). What is effective schooling: A review of current thought and practice. Washington, DC: International Baccalaureate Organization. Retrieved from http://www.ibo.org/ globalassets/publications/ib-research/whatiseffectiveschoolingfinal-1.pdf

Scheerens, J. (Ed.). (2014). Effectiveness of time investments in education: Insights from a review and meta-analysis. Dordrecht: Springer.doi:10.1007/978-3-319-00924-7

Scheerens, J. (2015). Theories on educational effectiveness and ineffectiveness. School Effectiveness and School Improvement, 26(1), 10–31.doi:10.1080/09243453.2013.858754

Scheerens, J. (2016). Educational effectiveness and ineffectiveness: A critical review of the knowledge base. Dordrecht: Springer.

Scheerens, J., & Bosker, R. J. (1997). The foundations of educational effectiveness. Oxford: Pergamon. Seidel, T., & Shavelson, R. J. (2007). Teaching effectiveness research in the past decade: The role of theory and research design in disentangling meta-analysis results. Review of Educational Research, 77(4), 454–499.doi:10.3102/0034654307310317

Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: An introduction to basic and advanced multi-level modelling (2nd ed.). London: Sage.

Stringfield, S. C., & Slavin, R. E. (1992). A hierarchical longitudinal model for elementary school effects. In B. P. M. Creemers & G. J. Reezigt (Eds.), Evaluation of educational effectiveness (pp. 35–69). Groningen: ICO.

Thomas, S. M., Gana, Y., & Muñoz Chereau, B. (2016). England: The intersection of international achievement testing and educational policy development. In L. Volante (Ed.), The intersection of international achievement testing and educational policy: Global perspectives on large-scale reform (pp. 37–57). New York, NY: Routledge.

Thomas, S. M., Peng, W. J., & Gray, J. (2010). Modelling patterns of improvement over time: Value added trends in English secondary school performance across ten cohorts. Oxford Review of Education, 33(3), 261–295.doi:10.1080/03054980701366116

Timmermans, A. C. (2012). Value added in educational accountability: Possible, fair and useful? (Doctoral dissertation). Groningen: GION Onderwijs/Onderzoek.

Tymms, P. (2004). Effect sizes in multi-level models. In I. Schagen & K. Elliot (Eds.), But what does it mean? The use of effect sizes in educational research (pp. 55–66). Slough: National Foundation for Educational Research.

Tymms, P., Merrell, C., & Wildy, H. (2015). The progress of pupils in their first school year across classes and educational systems. British Educational Research Journal, 41(3), 365–380.doi:10.1002/berj. 3156

Van de Grift, W. J. C. M., Chung, S., Maulana, R., Lee, O., & Helms-Lorenz, M. (2017). Measuring teaching quality and student engagement in South Korea and The Netherlands. School Effectiveness and School Improvement, 28(3), 337–349.doi:10.1080/09243453.2016.1263215

Vermeer, N., & Van der Steeg, M. (2011). Onderwijsprestaties Nederland in internationaal perspectief [Educational achievement in The Netherlands in an international perspective] (CPB background document to CPB Policy Brief 2011/05). Den Haag: CPB.

(21)

von Bertalanffy, L. (1968). General systems theory. New York, NY: George Braziller.

What Works Clearinghouse. (2008). What Works Clearinghouse evidence standards for reviewing studies, Version 1.0. Retrieved fromhttps://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_ version1_standards.pdf

Wright, S. P., White, J. T., Sanders, W. L., & Rivers, J. C. (2010). SAS EVAAS statistical models (SAS White Paper). Retrieved fromhttp://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.437.6615&rep= rep1&type=pdf

Referenties

GERELATEERDE DOCUMENTEN

The prognostic values of absence of EEG-R for prediction of poor outcome and presence of EEG-R for good outcome were described as speci ficity, sensitivity, positive predictive

The lumped model accurately accounts for both intrinsic bursting and post inhibitory rebound potentials in the neuron model, features which are absent in prevalent neural mass

To fill this gap, based on the theory of social distance, this paper is going to examine which one will generate higher brand recall, celebrity eWoM on social media or

I expect that for individuals induced with low feelings of competence the goal of becoming competent again becomes more relevant resulting in a higher valuation of products

The main challenges surrounding long-term experiments where robots provide therapeutic interventions to improve self-efficacy and well-being are thought to be: the

In general it can be concluded that for an unstable flame the thermal energy released from chemical reactions is fed in to the acoustic fluctuations in the burner through a

The logic of social reproduction (Bourdieu, 1986) predicts that if parents have a cosmopolitan disposition, they will transfer this cosmopolitan (cultural) capital on to their

Besides the fact that it is important to examine how new ventures can overcome the high rate of attrition in the Dutch financial service sector, this sector has been