• No results found

The development and selection of pupils in the Dutch primary school system

N/A
N/A
Protected

Academic year: 2021

Share "The development and selection of pupils in the Dutch primary school system"

Copied!
54
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Faculty of Economics and Business

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided

up into a number of sections and contains references. An outline can be something like (this

is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper

from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page)

(c) Introduction

(d) Theoretical background

(e) Model

(f) Data

(g) Empirical Analysis

(h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you

use should be logical) and the heading of the sections. You have a free choice how to

list your references but be consistent. References in the text should contain the names

of the authors and the year of publication. E.g. Heckman and McFadden (2013). In

the case of three or more authors: list all names and year of publication in case of the

rst reference and use the rst name and et al and year of publication for the other

references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that

actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty

as in the heading of this document. This combination is provided on Blackboard (in

MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number

(d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

1

The Development and Selection of Pupils in the Dutch

Primary School System

by: Mark D. Verhagen

,

Supervisor: Dr. J.C.M. van Ophem

,

Second reader: Dr. K.J. van Garderen

Abstract

This thesis addresses two enduring sources of controversy in the Dutch primary school system. First, the influence of heterogeneity among schools on pupil development is evaluated. Second, the assessment of a pupil’s ability by a central test and by a teacher, which determine a pupil’s level of secondary education, are compared. Additionally, explicit drivers of a mismatch be-tween the two and a possible revision of a mismatch are analysed. By using multilevel models, pupil, class and school effects are accounted for.

As much as a quarter of variability in pupil development is found to be due to school differences. These differences predominantly manifest themselves in a general manner, affecting all pupils within, and little evidence is found that some schools explicitly (dis)favour certain pupil groups. It is also found that teachers and central tests differ strongly in their evaluation of pupil ability due to both observable pupil characteristics, like test scores, but equally due to unobserved char-acteristics. It is found that these differences lead to structural up- and downward mismatches in the system. Given eligibility for an upward revision, better performing pupils are more likely to actually receive one, but revisions are found to be predominantly a school level decision.

University of Amsterdam, Faculty of Economics and Business. MSc programme: Econometrics. Specialisation: Free Track.

Date of final version: August 15, 2017. Direct correspondence to: markdverhagen@gmail.com.

(2)

responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of com-pletion of the work, not for the contents.

(3)

Malcolm X

(4)
(5)

1 Introduction 1

2 Educational research and the Dutch system 2

2.1 Educational research since the Coleman Report . . . 3

2.2 The Dutch primary school system . . . 4

2.3 Prior literature on the Dutch system . . . 5

2.4 Design choices in educational research . . . 7

3 Econometric methodology 7 3.1 Statistical complexities in educational research . . . 8

3.2 Pupil development . . . 8

3.2.1 Hierarchical linear model . . . 9

3.2.2 Panel extension of the HLM . . . 11

3.2.3 Estimation technique. . . 12

3.2.4 Statistical inference . . . 13

3.3 Pupil advice. . . 13

3.3.1 Hierarchical ordered response model . . . 14

3.3.2 Estimation technique. . . 14

3.3.3 Unobserved heterogeneity in residuals . . . 15

3.3.4 Statistical inference . . . 15

3.3.5 Dichotomous hierarchical response model . . . 16

4 Data & descriptive statistics 17 4.1 Data . . . 17 4.2 Pupil development . . . 19 4.3 Pupil advice. . . 21 4.4 Modelling strategies . . . 24 5 Results 25 5.1 Pupil development . . . 25 5.2 Pupil advice. . . 31

5.3 Findings beyond the research scope. . . 35

6 Concluding remarks 36

(6)

1

Introduction

This research addresses two hotly debated issues re-garding the Dutch primary school system. First, whether differences in pupil development can be partly attributed to heterogeneity among schools. Second, how a central testing mechanism compares to a pupil’s teacher when assessing a pupil’s ability. Through these analyses, two distinguishing features that set the Dutch system apart from most other Western systems are evaluated. These are:

1. An early, rigid selection of pupils upon comple-tion of primary school which determines their level of secondary school education; and

2. The relative freedom granted to Dutch primary schools to choose their own educational ap-proach.

This research will thus provide valuable, system-specific insights that should assist policymakers, prac-titioners and researchers in evaluating these two en-during sources of controversy within the Dutch sys-tem.

The need for and complexity of proper in-ference The importance of an effective, well-functioning educational system is no longer debat-able. Schooling has been found to strongly pre-dict economic growth (e.g. early work by Barro, 1991; Mankiw, Romer & Weil 1992 or more recently Hanushek and Woessmann, 2008) and perhaps more indicative of its importance, education is represented in almost every general index on life quality (e.g. the Eurostat or UN indices). This makes the constant evaluation and improvement of an educational sys-tem’s performance of central importance to any soci-ety. This necessity is perhaps even greater for those school systems like the one in The Netherlands -that are fully funded by the government, given its increased social and political dimensions.

There are various challenges that make the evalua-tion of an educaevalua-tional system tricky. The first chal-lenge concerns educational research design in gen-eral. Pupil’s test scores remain the mainstay indi-cators driving educational research, but they inad-vertently hollow a child’s true development. Fur-thermore, noise in measuring ability is problematic, whether this noise is due to a teacher’s inconsistency in grading or unforeseen circumstances at the time of a test. Finally, new topics and theories are tested over time, complicating the comparability of a pupil’s scores on two different tests. These problems are en-dogenous to educational research, but steps to min-imise their consequences should be taken nonetheless. The second challenge concerns the appropriate

statis-tical techniques that should be used. To obtain valid inference on the plethora of aspects that characterise a school system, it must be recognised that pupils are organised into classes, which in turn aggregate into schools. Failure to acknowledge this structural fact can lead to the confounding of effects attributable to schools with effects attributable to pupils and vice versa. This complexity is often overlooked by poli-cymakers, who use standard statistical techniques to drive decision-making.1

The final challenge regarding educational research is that most of its findings are generally only relevant for the specific case that is examined. Every system is unique in its characteristics and cultural aspects, making it almost impossible to transfer inference from one system to the other (as comprehensively discussed by Wimpelberg et al., 1989). Coincidentally, those characteristics that distinguish one system from an-other are often also its most controversial.

The Dutch case in point The lack of homogene-ity among Dutch primary schools and rigid selection practices have become sources of two growing con-cerns in The Netherlands.

First, there is suspicion that the relative freedom for schools to choose their own educational approach is causing large differences in school performances (On-derwijsinspectie, 2016). This has fostered a discus-sion on the need for stricter guidelines regarding class setup, school materials and other aspects of a school’s didactic strategy (Trouw, 2017).

Second, the early selection of pupils has received crit-icism both within and outside of the Netherlands. Internationally, the timing of pupil selection in the Netherlands has been criticised as being too early, po-tentially leading to downward spirals and the waste of potential (OECD, 2007). Domestically, the debate has focussed mainly on the way in which this selection should take place: either through a central testing mechanism or the assessment of a pupil’s teacher.2 Closely related to this latter discussion is a school’s decision to revise its initial assessment in the case of a better central test outcome. It has been claimed that the choice to revise is biased towards girls and pupils with a migrant background. This has become yet

an-1The annual report by the Onderwijsinspectie - the Dutch

governmental institution tasked with reporting on the general quality within the educational system - drives most of the bate among policy-makers, but relies disproportionately on de-scriptive statistics.

2A recent law change in 2015 shifted importance from a

pupil’s performance on the central test towards a teacher’s evaluation, making the former a ‘second opinion’ to the latter. Subsequent calls for revisions back to the old situation -among others by the Secretary of Education - illustrate that this decision remains controversial and discussion is ongoing (NRC, 2016; Volkskrant, 2016).

(7)

other source of discussions with respect to the Dutch practice of selection (Onderwijsinspectie, 2016). The current research These two general sources of controversy are the inspiration for and focus of this research. The goal of this research is to verify whether the controversies mentioned above have le-gitimate grounds for discussion and, if so, whether concrete, fact-based insights or pointers for future re-search can be delivered to assist in resolving them.3 On the first issue of curriculum freedom, this research will focus on attributing variability in a pupil’s de-velopment to the pupil, class and school level. Ad-ditionally, covariates of this development will be ex-amined. Finally, given that sizeable differences be-tween schools are found, this research will further ex-amine whether these differences manifest themselves by (dis)favouring certain pupil groups over others. This yields the following research question and sub-questions:

Research focus I: How do school differences affect pupil development?

1. How much variability in pupil development can be attributed to the school, class and pupil level? 2. What are the covariates of pupil development? 3. Do differences among schools manifest

them-selves through (dis)favouring certain pupil groups?

As to the second issue of pupil selection, interest lies mainly in examining how pupil, class and school char-acteristics, including a pupil’s grades, affect a pupil’s score on a central test and his or her assessment of ability by a teacher. Closely related, the drivers of a mismatch between the two assessment types is evalu-ated. Finally, the subsequent choice to revise a mis-matched assessment is examined. This yields the fol-lowing research question and sub-questions:

Research focus II: How do pupil, class and school characteristics relate to the selection process into sec-ondary education in the Netherlands?

1. How does the assessment of a pupil’s ability by a central test compare to that by a teacher? 2. What drives the mismatch between a teacher’s

assessment and that by a central test?

3. What drives the choice to revise a mismatch of a pupil’s assessment?

3Up to this point, discussion has failed to penetrate beyond

simple descriptive statistics and gut feeling, as can be judged from the numerous opinions and commentaries that have been voiced (e.g. CBS, 2016; Onderwijsinspectie, 2016).

To answer these questions, an educational service firm has provided this research with an unique database of more than 5,000 pupils who finished their primary school career in the summer of 2017. This database consists of information on the pupil, class and school level and includes scores on various tests - including a central testing mechanism - and teacher assessments of ability. To complement this data, additional infor-mation has been obtained from public sources. This thesis is structured as follows: Section 2 will give an introduction to the educational research field and the Dutch primary school system. In Section 3, the econometric methodology is discussed. Section 4 presents the data, some descriptive statistics and the model specifications, while Section 5 gives the results. Section 6 concludes with a summary and discussion of this research’s implications and limitations.

2

Educational research and the

Dutch system

The origin of empirical research within the educa-tional branch can be traced back to the so-called ‘Coleman Report’, published by Coleman et al. in 1966. Commissioned by the United States Health, Education and Welfare department, its purpose was to give an in-depth analysis of the workings of the US educational system. A special policy interest was to determine whether structural inequality of oppor-tunity was present in the system, specifically for low-income and minority groups. Other research interests included the influence of financial resources within a school on the development of its pupils or the effect of teacher quality, to name just a few.

To this end, the researchers obtained access to an immense database of more than 650,000 students, in-cluding - among others - their test scores, racial and social economic background and many other charac-teristics on the pupil, class and school level. The sheer size of the study and the breadth of its research goals make it one of the most ambitious research projects in the field to date. Its findings were, however, not without controversy. Both in its direct aftermath (e.g. Hanushek, 1971) and up to this very day, (e.g. Rivkin, Hanushek & Kain, 2005; Downey, 2016) its conclusions continue to be debated and many of the report’s initial questions remain unresolved.

One of these controversial findings was that factors outside of the classroom were found to be far more influential than the factors inside it. Preschool in-fluences alone - like parental reading - accounted for more variability than school characteristics like teacher quality and school resources, a finding that was further corroborated by Ferguson et al. (1971)

(8)

and Jencks et al. (1972). This result was contrary to the general belief that school characteristics were the most important drivers for the development of pupils and represented a conclusion that many found difficult to digest.

In retrospect, the Coleman Report’s most important contribution to educational research is perhaps ex-actly the controversy which it sparked. Due to its controversial findings, the report became a catalyst for new empirical research on the educational sys-tem in the US. This research can be traced along two paths. On the one hand, the controversial impli-cations of the report were further investigated with extensive follow-up research in different settings and contexts. On the other, it started an expansive de-bate regarding the validity of the statistical methods that were used in the report and within educational research in general.

A discussion of the latter is postponed to Section 3. Instead, this Section will proceed as follows: first, a general overview of the recurring themes within edu-cational research is presented. After that, the Dutch system will be discussed, including some of the most important ways in which it differs from other systems. To conclude, a selection of the most recent research on the Dutch system is discussed and the current re-search will be placed within this field.

2.1

Educational

research

since

the

Coleman Report

Since the Coleman Report, an enormous body of work has emerged on the functioning of educational sys-tems throughout the world. To give an indication of this prolific production, Haertel et al. estimated in 1983 that the number of studies performed on teacher effectiveness alone already exceeded 10,000. This pro-duction rate has only increased as can be deduced from the various journals currently in print that are dedicated to educational research alone.4

In lieu of discussing a selection of research, this thesis presents a short overview of the main areas of analysis within the body of research. This decision was made simply because no overview could possibly be com-plete given the plethora of research available. The interested reader is referred to the more comprehen-sive attempts that are documented in the textbooks by Sammons (1995) or Teddlie (2000).

Educational research since the Coleman Report has focused on several focal areas. First of all, the so-ciological unit of the class has taken centre stage.

4To name just a few: The Journal of Educational Research,

Educational Research, International Journal of Educational Research, American Educational Research Journal, European Educational Research Journal, etc...

Within that context, the focus has varied from the organisational structure (e.g. Harnischfeger and Wi-ley, 1980; Whitburn, 2001) and teaching styles and qualifications (e.g. Guarino, Hamilton, Lockwood & Rathbun, 2006; Boonen et al., 2014) to the demo-graphic composition of the class (see the review by Thrupp, 1995). When expanding the sociological unit to the school level, research has examined the influ-ence of size (e.g. Cotton, 1996; Humlum & Smith, 2015), the demographic composition of the school (Wu, 2013) and differences related to the denomi-nations of schools (e.g. Raudenbush & Bryk, 1986; Opdenakker & Damme, 2006).

As pioneered by the Coleman Report, social eco-nomic status (SES) and minority background have remained another focal point (e.g. Driessen, 2002; Fe-instein, 2003; Alivernini & Manganelli, 2015). White (1982) discusses more than 100 studies dealing explic-itly with the impact of SES, concluding that there is little significant relationship with academic achieve-ment. On the other hand, re-evaluations of his meth-ods with more recent data and more advanced tech-niques have shown that there is indeed a significant influence (Sirin, 2005; Geiser & Santelices, 2007). In an effort to further investigate the Coleman Re-port’s controversial finding that school traits were of marginal importance, out-of-school effects like home-schooling (Melhuish et al., 2008) and various social circumstances at home have received attention (Fo-gelman & Goldstein, 1976). Some specific examples include the influence of home language (Van Laere et al., 2014) and the effect of parental involvement (e.g. Keith et al., 1998; Fan & Chen, 2001; Jeynes, 2005). A final area concerns the phenomenon of pupil track-ing, indicating the manner in which pupils are seg-mented into distinct educational programs or tracks. The implementation of pupil tracking varies from complete segmentation of pupils with different abili-ties to less stringent forms, for example where pupils remain in the same class but do different exercises. Ariga & Brunello (2007) find positive effects with re-spect to more segmented tracks, whilst Van Elk et al. (2011) and Hanushek & Woessmann (2006) find negative effects. To add to the confusion, Jakubowski (2009) and Pekkarinen (2009) find no effect.

A selection of important studies and their main conclusions can be found in Table 2.1. From this overview, it can be deduced that there are few clear-cut conclusions on the effect of certain characteristics of a system on its performance. These contradictory findings were once blamed on statistically invalid pro-cedures (e.g. the re-evaluation by Alexander & Pal-las, 1985), but since the development of more elabo-rate techniques in the eighties and a general conver-gence of methodologies, it seems that many findings within one research design simply do not translate

(9)

Table 2.1: A selection of prior research findings

Focal Area Significant Insignificant or inconclusive

Large effect Small effect

School effects Reynolds & Creemers (1990); Tabberer (1994); Sammons et al., (1993)

Coleman et al., (1966); Jenck et al., (1972)

Scheerens, (1992); Creemers (1994)

Principal Gray, (1990); United States De-partment of Education, (1987)

Hallinger & Leithwood, (1994) Scheerens, (1992), Bossert et al., (1982)

Demographics Coleman et al., (1966); Thrupp, (1999); Wu, (2013)

Nuttal (1989) Denomination Lee, Bryk & Smith (1993) ;

Raudenbush & Bryk (1986)

Opdenakker & Damme (2006)

SES Sirin (2005) ; Geiser &

San-telices (2007) ; Caldas & Bankston (1997); Driessen (2002 )

Bryk et al. (1990) White (1982)

Race Bankston & Caldas (1996) Entwisle & Alexander (1992); Rivkin (2000)

School size Cotton, (1996) Humulum & Smith (1995)

Pupil tracking Ariga & Brunello (2007) ;Van Elk et al. (2011) ; Hanushek & Woessmann (2006)

Jakubowksi (2009) ; Pekkarinen (2008)

beyond the system that is analysed.

There is now widespread consensus that the con-textual differences between educational systems are so significant that every research must be evaluated within its specific scope (Creemers, Reynolds & Swint (1994); Fuller, (1994); Reynolds et al. (1994); Wim-pelberg et al. (1989)). This means that educational research, for all its abundance, can give many pointers to general tendencies, but must always be approached with caution and applying findings from one context to another can be reckless.

With this in mind, the Dutch primary school system will be scrutinised next, after which some of the most recent research on its workings will be discussed.

2.2

The Dutch primary school system

In the Netherlands, education is mandatory for all children from the age of 4 until the age of 16. This general period is divided into primary education, from the age of 4 through 12, and secondary education which takes a minimum of four and maximum of six years depending on the track a pupil enters. Both the primary and secondary systems are completely funded by the Dutch state.

Besides compulsory education, there are three dis-tinct perspectives along which an educational sys-tem’s characteristics can be described (Allmendinger, 1989; Kerckhoff, 2001). First, there is the level of ‘stratification’, indicating the manner in which children are selected and partitioned into separate ‘tracks’.

Secondly, there is the degree of ‘standardisation’, which can be bisected into ‘standardisation of in-put’ and ‘standardisation of outin-put’. A high degree of standardised input indicates that the curriculum, methodologies and resources available to schools are dictated by a centralised organ. A high degree of standardisation of output indicates that there are many standardised tests and centrally organised rules / regulations regarding diplomas or certificates that may or may not be awarded.

Finally, there is the degree of ‘relevance for the labour market’. This deals with the fitness of the system to provide the right types of future employees desired by the labour market. This latter aspect becomes more important later in a pupil’s educational career and will be disregarded in this work, since this work deals solely with the primary school system.

The Dutch system strongly differentiates itself from other Western-style educational systems in two ways. First, there is very early and rigid selection, leading to a relatively high degree of stratification within the system. Secondly, there is a lot of freedom available to primary schools to develop their own educational phi-losophy and use differing techniques to achieve didac-tic and pedagogic goals. As such, there is little stan-dardisation of input besides all schools being funded by the state.

Stratification in the Dutch system When pupils move from the primary system to the sec-ondary system, they are segmented based on an as-sessment of their general ability at that time. Based

(10)

on this assessment, pupils are then allowed into one of the eight levels available in the Dutch secondary school system. This mechanism makes selection in the Dutch system one of the most rigorous and early among Western countries.5

A pupil’s assessment is constructed based on a cen-tralised testing mechanism, called a Centrale Eind Toets (CET), and through a teacher advice, which is an assessment given by a pupil’s teacher. Currently, the latter is leading and the former merely serves as a second opinion. When the former indicates a higher level than that determined by the teacher, a pupil be-comes eligible for a possible revision of his/her ability level.

The Dutch selection scheme is made all the more sig-nificant by the fact that the Dutch secondary system is stratified in nature. As such, the interaction be-tween pupils from other levels is further minimised (Dronkers et al., 2014). This trend of stratifica-tion within stratificastratifica-tion has furthermore increased in recent years (Elffers, Van de Werfhorst & Fischer, 2015).

The implications of such selection are significant. Re-search by Tolsma & Wolbers (2010) indicates that the first years of secondary school are most indicative for a pupil’s later role in society and Moore (2003) finds that a pupil’s direct environment is one of its most important promoters of development. These findings indicate that wrongfully assigning pupils to levels below their potential can have severe conse-quences. Also, research by Bol et al. (2014), Ko-rthals (2015) and Werfhorst et al. (2014) indicate that early segmentation increases inequality between pupils and research by Dronkers (2015) and Skopek et al. (2016) shows that early selection stimulates differences among schools.

All the above indicate that selection is one of the most fundamental aspects of the Dutch system, since this selection subsequently changes the nature of the rest of the system and implicitly thereby also society.

Standardisation in the Dutch system There is a high degree of standardisation of output present in the Dutch primary school system - as already indicated by the role of central tests in the selec-tion scheme. In addiselec-tion to the CET, many general tests are taken throughout a pupil’s career. This centralised tendency has however become more or

5The Dutch system is matched only by Germany, Austria

and Switzerland in this rigour (Prokic-Breuer & Dronkers, 2012). Other European countries like Sweden, France, Eng-land and Italy choose to apply selection much later, generally around the age of 15 or 16 (Hillmert & Jacob, 2010). Selection is almost non-existent in most Scandinavian countries and in the United States (Van der Werfhorst et al., 2015), where chil-dren of all ‘levels’ interact throughout their initial education.

less the norm in most Western countries (Van der Werfhorst et al., 2015).6

With respect to input standardisation, on the other hand, the primary school system in the Netherlands is one of the least standardised Western countries (On-derwijsinspectie, 2016). Primary schools are free to alter their methods and educational philosophies to their liking.7 This relatively free nature of the pri-mary system has often been touted as one of the strengths of the system, but in recent years there have been some worrisome signals. A recent report by the OI found large differences among schools in their ability to foster pupil development and pointed at the lack of standardisation of input as one of the main drivers of this result (Onderwijsinspectie, 2016). To conclude, the Dutch system differentiates itself fundamentally from other systems. The strong fo-cus on selection and the subsequent stratification of pupils, together with a very low standardisation of in-put raises questions that are endogenous to the Dutch system and thus have to be answered through analy-sis from within it.

2.3

Prior literature on the Dutch

sys-tem

The body of research on the Dutch system can gener-ally be divided between studies identifying covariates of pupil development and studies identifying covari-ates of either CET scores or school advice.

Pupil development Regarding the covariates of pupil development, Kloosterman et al. (2009) anal-yse the effect of parental education on the type of secondary education a pupil reaches. By analysing five cohorts between 1965 and 1999, they find that a parent’s background has become more determinant over time in how pupils develop. They conclude that higher parental education effectively gives pupils a head start with respect to others, which favours them disproportionately throughout their primary school careers. Multiple earlier studies had found similar ef-fects in the Netherlands (Bakker and Cremers, 1994; Willemse, 1987) and also abroad (Erikson et al., 2005 for the UK; Erikson, 2007 for Sweden).

These findings indicate that up to some degree, the Matthews effect may be present in the Nether-lands. The Matthews effect represents the phe-nomenon where a school system disproportionately

6A notable exception to this rule is the US, where there is

almost no centralisation of testing. Instead, higher education institutions usually require out-of-school tests like the SATs or GRE.

7There are some rather broad minimum requirements in

order to be eligible for government funding, but generally there are few didactic requirements.

(11)

favours those who start ahead (Stanovich, 1986; Bast & Reitsma, 1997, 1998; Cain et al., 2004). Its pres-ence in the Netherlands is explicitly researched within the Dutch system by Luyten and Ten Bruggencate (2011), who find that for pupils’ reading skills, chil-dren with lower initial scores tend to fall further be-hind.

Another correlate of growth that has been studied is class composition. Ohinata and Van Ours (2016) ex-amine how the share of immigrant children affects ed-ucational performance of native children. They find that the quality of the learning environment in the class decreases, but that there are no substantive neg-ative effects on the performance of nneg-ative children. Finally, Guldemond & Bosker (2009) perform analy-ses of growth rates between pupils of different SES. They find that rates of attainment differed signifi-cantly for low SES pupils with respect to higher SES pupils. Other findings include that non-native speak-ers struggle initially, but make up some of these defi-ciencies over time.

There has been little research on the explicit par-titioning of pupil development on the various levels of the educational system, as this research pursues. Also, most of the above-mentioned research concerns either broad indications of development - for example the general educational level a pupil reaches - or a spe-cific didactic areas - for example reading proficiency. There has been little research that examines pupil de-velopment through actual test scores and across vari-ous didactic areas as this one will attempt. Addition-ally, some of the covariates examined in this research will provide new insights for the Dutch system, for example the effect of school size, denomination and level of urbanity on development.

Pupil selection There are three types of research on pupil selection in the Dutch system. The first type concerns covariates of central test scores, the second of school advices and the third treats both in some way.

In the research on central tests, it is found that chil-dren from Turkish and Moroccan descent - two large immigrant groups in the Netherlands - score relatively low (Driessen & Dekkers, 1997; Van de Werfhorst & Van Tubergen, 2007). Fleischmann & de Haas (2016) explain these differences by citing less parental involvement under migrant children. Luyten & De Wolf (2011) perform a large scale path analysis of the most popular CET: the CITO test. They anal-yse whether mean test scores are driven by the de-mographic composition of a school or school level characteristics. They conclude that the latter mainly drives differences between CET results. To conclude, Luyten & Bosker (2004) find that performance on

cognitive tests - perhaps unsurprisingly - best predict pupils’ scores on the CET and there is little reason to believe that social economic background or other demographic characteristics strongly influence these scores.8

In the research on school advice, Driessen (2006) per-forms an analysis on the covariates of school advice, including cognitive tests, indicators of behaviour in class, demographic characteristics and out-of-school influences. He finds that there is also little reason to believe that demographic characteristics play a strong role in determining school advice. The research by Driessen is one of the few that take into account the possibility that there may be school and class level effects present beyond the available data.

Concerning the relation between the two types, Driessen & Smeets (2007), De Boer et al. (2010) and Timmermans et al. (2013) combine both by analysing mismatches between the two.9 The main focus of

these analyses is to examine what predictive power remains for certain demographic characteristics after correcting for cognitive abilities (oftentimes, the CET is also included as an indicator). Any significant re-sult would then constitute some sort of bias from the teacher. Since their results are largely similar, only the findings by Driessen & Smeets will be discussed briefly.

Driessen & Smeets find that high performing pupils, girls and pupils with highly educated parents gener-ally obtain higher school advices given their cognitive abilities. Just like the other authors, they make use of linear regressions of the distance between teacher advice and cognitive abilities by simply numbering the eight advice types and using them as continuous variables in their analysis. The latter design choice is questionable, however, given the categorical nature of advice and the fact that most CETs do not linearly transpose into advice levels.10

To this author’s knowledge, no extensive research has been done to relate pupils performance during their school careers to both their central test scores and school advice. Up to this point, CETs have mainly been used as a predictor for teacher advice, rather than a separate assessment of performance. Two im-portant omissions to the current body of work are research on revisions of assessments after a mismatch has occurred and the influence of class and school ef-fects on the two assessment forms.

A final general comment concerning the body of

re-8The question remains what the effect of demographic

char-acteristics are on these cognitive scores in the first place.

9This research directly followed another report by the OI in

2007, citing worries of inequality in advice forming within the primary system (Driessen et al., 2007).

10For the most popular CET, the CITO test, the relationship

(12)

search on both selection and pupil development is the fact that most of the research described here - and in fact the majority of Dutch educational research - uses data that has been collected in the nineties, with the exception of a couple of studies that have obtained their data somewhere in the first decade of this mil-lennium.11 Although the system has remained

rel-atively similar, it is debatable if information dating back multiple decades is still relevant today.

To conclude, this research will add to the current body on the Dutch system in various ways. With respect to the research on pupil development, it will explicitly assess the drivers of development on the pupil, class and school level, enabling inference to be done on the effects of heterogeneity among schools. Furthermore, it will provide insights on new covari-ates of development on the pupil, class and school level not researched before within the Dutch system. This research also distinguishes itself by assessing ac-tual in-school tests, while most other research con-cerns international testing schemes or other atypical tests that focus on a specific area of development. This research will also contribute to the knowledge concerning selection in the Dutch system by explic-itly modelling both teacher advice and CET perfor-mance. Through this, differences between the two types can be evaluated. Also, the modelling of the drivers of a revision of advice is a new addition to the current research body, as are the application of non-linear techniques and the evaluation of class and school effects.

Perhaps most importantly, this research uses a database that is unique in its relevance, since the data enclosed within it pertains to the most recent cohort of pupils that will leave the primary system in the summer of 2017. As such, it will give an unique view on the current state of affairs within the system. Before proceeding to the methodology underlying these research goals, a final note will be made on the design choices and general limitations of educational research.

2.4

Design choices in educational

re-search

One of the perennial challenges to educational re-search is how to define the dependent variable, or in other words: what are we exactly looking to model? As beautifully and comprehensively described by Wil-let (1989), as much as there lies a shadow between the data and the statistical tools available - referencing

11Only Ohinata & Van Ours, Timmermans et al. and

Fleis-chmann use data after 2005. The large majority resorts to data collected in the nineties.

the methodological discussions in the eighties - there equally lies a shadow between the data and reality. Educational research generally focuses on a handful of standard metrics that indicate the change / devel-opment pupils go through while in the system. Essen-tially, this is the differential as measured by a pupil’s performance on a certain test relative to another test. There are obviously many different stories to tell for both an increase and a decrease with respect to these types of measures.

Even if we can stipulate that higher scores are strictly better, there are two aspects that have to be taken into account in educational research. The first is whether there is “validity/equatability of the out-come measure over time” (Willet, 1989: 351; or sim-ilar concerns by Linn & Slinde (1977); Zimowski et al. (1982)). What is meant here is that every re-searcher within the educational branch should ques-tion whether available performance measures over time can reasonably be compared.

The second aspect is the measurement itself. In test-ing ability, inaccuracies will inadvertently arise. As described by Stanley (1971: 356): “Two sets of mea-surements of the same features of the same individu-als will never exactly duplicate each other.” Stanley touches on the intrinsic variability in testing, whether caused by errors made by the teacher while grading, luck or the fickle nature of the mind.

Both problems are endogenous to educational re-search and will not be easily resolved. By definition, development implies continuously learning new top-ics and theories and thus no two scores will ever mea-sure the same thing. As such, there is little hope that measurement error can be averaged out by obtaining multiple observations per test, nor will there ever be true validity / equatability. Nonetheless, both should be assessed as will be done for this research in Section

4, when the data is presented.

First, though, the statistical complexities of edu-cational research will be described and economet-ric models proposed to answer the various questions posed by this research.

3

Econometric methodology

The two focus areas of this research concern different aspects of the Dutch system. Coincidentally, the two areas also differ in the types of models that should be applied. For our analysis of development, continuous dependent variables allow for rather straightforward linear models. Our analysis of pupil advice, however, concerns categorical data and will require techniques that allow for a dependent variable with a binomial or multinomial outcome. Both analyses will furthermore

(13)

require methods that take into account the hierarchi-cal nature of the data.

This Section will start with discussing the need for hi-erarchical modelling. After that, a hihi-erarchical model is proposed for continuous outcomes, coinciding with the first research focus. Then, this model will be ex-tended to the binomial and multinomial cases, corre-sponding to the second research focus. Where nec-essary, some extra statistical theory will be touched upon.

3.1

Statistical complexities in

educa-tional research

One of the biggest obstacles to modelling educational systems has been to properly account for its nested structure. Pupils have characteristics on the indi-vidual level, but are also part of a classroom. These classrooms in turn have certain characteristics of their own, like its size or teacher, but are themselves also part of schools, once again with certain traits and characteristics.

Applying classical techniques either forces the re-searcher to expand all information on the pupil level - by adding class and school characteristics as personal regressors - or alternatively, to aggregate over all lower level observations and examine classes or schools as a whole. Aggregation is obviously unattractive, since it forces the researcher to relin-quish information on the aggregated levels (Aitkin & Longford, 1986). On the other hand, expanding all higher-level information on the pupil level and pro-ceeding with standard statistical techniques like Or-dinary Least Squares (OLS) would make wrongful as-sumptions on the data, for example independence be-tween observations. Such an approach generally leads to two problems.

First, if pupils in the same class or school share char-acteristics with each other that affect our variable of interest but are not grasped by the available data, the assumption of independence would pose classi-cal omitted variable problems or ‘aggregation bias’ (Burstein, 1980). This type of specification error will generally result in misspecification and inconsistent estimators (Holt, Scott & Ewings, 1980). Second, when variation resulting from various levels in the model is ‘collapsed’ on a single level, it becomes at-tributed to the level in question. Statistical tests will then generally provide incorrect indications of signif-icance (Aitkin et al., 1981).

Various solutions have been proposed to address mul-tilevel heterogeneity within clustered data. Early ex-amples are the random regression models that have been developed for both continuous (Bock, 1983) and dichotomous data (Gibbons & Bock, 1987). These

models are similar to Analysis of Variance (ANOVA) models, where variations on the various levels is accounted for by adding level-specific error terms. These types of models, however, do not enable the researcher to add variables on the higher levels of the system that could otherwise serve as explanatory variables for the level-specific variability. In order to account not only for variation on the higher levels, but also do inference on the drivers of this variability, an explicit multi-level approach is required.

Models that allow for a multilevel approach go by var-ious names, ranging from multilevel models (Mason et al., 1983), random coefficient models (Rosenberg, 1973) to hierarchical linear models (Strenio, Weisberg & Bryk, 1983), but all tackle the problem in essen-tially the same way. A comprehensive overview of the conceptual development of these models is given in the overview by Raudenbush & Bryk (1988). Over time, the models have stayed conceptually the same, but the development of numerical optimisations tech-niques has enabled more complex multilevel models to be estimated. An extensive treatment of this de-velopment is given by Browne and Draper (2006). In order to avoid confusing the reader with the different terminology, this research will proceed with calling these general types of models ‘hierarchical models’. Within the specific context of educational research, the importance of hierarchical models has been em-phasised early on by Raudenbush (1986). Since then, it has become clear that ignoring the nested struc-ture of an educational system has produced erroneous results in the past and quite possibly resulted in er-roneous policy as well. Cronbach and Webb (1975) re-analyse a treatment effect with and without tak-ing into account hierarchical effects and observe sign changes. Similar significant re-evaluations have been done by Willms (1984) and Alexander and Pallas (1985).

In what follows, hierarchical models for both desig-nated research focuses will be proposed. Both re-quire a different type of basic model, which will be discussed. Estimation techniques and statistical in-ference are discussed accordingly.

3.2

Pupil development

Research focus: How do school differences affect pupil development?

In the analysis of pupil development, pupils’ test score are used. Importantly, there are no further indica-tions of ability available - for example IQ tests - be-yond these test scores. As such, this omission in the data will have to be accounted for through modelling. Scores are available in various areas of didactic devel-opment, but the models proposed will be applicable

(14)

to each area in the same way. Further discussions of the data is postponed to Section4.

Within this research focus, three analyses are pro-posed . The first approaches the data from a longitu-dinal perspective and uses a pupil’s scores as obser-vations over time. This allows pupil-specific random or fixed effects to be added to the analysis, solving the absence of ability indicators under the assump-tion that they are time-constant. This analysis will serve to give a broad indication of the covariates of pupil scores. The second, main analysis will concern pupil development by evaluating differences in scores over time. This wil equally solve the absence of abil-ity indicators, under the same assumption. Finally, the third analysis focuses on a pupil’s first year scores. Since development is intrinsically linked to a pupil’s starting point, an analysis on initial scores will com-plement the interpretation of development. Since ini-tial ability itself is examined, there is no need - or clear way - to adjust for pupil-specific ability. The three analyses are straightforward in the sense that all dependent variables are continuous and Hier-archical Linear Models (HLM) will suffice within this scope. These types of models explicitly account for every level in the educational system and will pro-vide estimates of the variability contributed by the various levels (sub-question 1), enable inference to be done on the effect on development of characteristics on the various levels (sub-question 2) and allow these effects to be different between schools (sub-question 3).

3.2.1 Hierarchical linear model

Ignoring all higher-level heterogeneity, we could model pupil development as follows:

Level 1 (pupil): yi= Xiβ + i (1)

where yi could for example represent some percentile

gain over a certain time period for pupil i, Xi would

contain all available information on the pupil, includ-ing school and class characteristics, and we would as-sume i to have mean zero. Estimating this model

with OLS assumes pupils are fully independent of one another and there is apparently no reason to believe that two pupils in the same class would be correlated beyond the available regressors. Given that this is an unlikely assumptions, we require a way to incorporate the hierarchical nature of the data.

A straightforward solution to account for school or class effects would be to include dummy variables for all the different groups a pupil could be part of. This will make the model less tractable (Raudenbush &

Bryk, 1988), but more importantly it will only add variability to the intercept of the model; all other regressors will still be required to coincide.12 A

hi-erarchical approach allows these to differ per class or school. More importantly, a hierarchical approach will not only allow the researcher to account for level-specific variance, but to explain it as well.

The potential drawback of hierarchical models is that we add potentially unnecessary complexity and as-sumptions to the model. Note, however, that if there is no added value of adding a hierarchical structure, the model simply reduces to a classical linear regres-sion model (Gelman & Hill, 2006). Statistical tests can furthermore indicate whether a hierarchy is called for.

Within the context of hierarchical models, this re-search will follow the notation of Raudenbush & Byrk (2002) since it gives an intuitive overview of the rela-tionship between the various levels within the model. This notation yields the following two-level HLM for an educational system, where for notational simplic-ity it is assumed that Xij and Zj include an intercept

and a single varying regressor only:

Level 1 (pupil): yij = β0j+ Xijβ1j+ ij (2a)

Level 2 (class): β0j = γ00+ Zjγ01+ η0j (2b)

β1j = γ10+ Zjγ11+ η1j (2c)

where Xij is an individual-specific regressor and Zj

is a class-specific regressor. We further assume ij ∼

N (0, σ2) and η j ∼ N ( 0 0  , Tη) and Cov(η0j, ij) = 0,

Cov(η1j, ij) = 0 meaning that the error terms are

uncorrelated between levels.

Finally, we allow the covariance matrix Tη to be

non-diagonal. This is important because the various ele-ments in βj come ‘together’ in the single-level model

and thus generally cannot be assumed to be inde-pendent from one another.13 From this dependence the necessity arises to make distributional assump-tions on the higher-level error terms that allow for

12With respect to the tractability of the model, a database

with 100 schools with an average of 2 classes per year would already require 200 dummies to account for varying intercepts only. Henderson (1982) discusses the further loss of efficiency when adding cross-terms to evaluate cluster effects on slopes. The choice for dummy analysis over hierarchical modelling also depends on the scope of the research. Within this research, in-terest does not lie on the coefficients for one specific school, but instead on the general workings of the system. When spe-cific schools are of interest, a dummy variable approach may be more appropriate (Rabe-Hesketh & Skrondal, 2012).

13Demeaning the regressor X

ijwill have a significant impact

on the correlation between β0and β1. As such, allowing

corre-lation between the two error terms in the separate models for β0and β1is necessary. For an empirical example, see Skrondal

(15)

this correlation.14

The hierarchical model specified in equations (2a) -(2c) allows both the intercept and the slope to differ between classes. Generally, the matrix Xij in (2a)

is allowed to have k regressors, all of which can be chosen to vary by group. Not all coefficients on the pupil level are required to vary and conditions on Tη

or the structure of Zj can enable the researcher to

specify for each parameter whether it should be con-stant, nonrandomly varying or randomly varying.15 Rewriting equations (2a) - (2c) yields the single equa-tion:

yij = β0j+ Xijβ1j+ ij

= γ00+ Zjγ01+ η0j+ Xijγ10+

XijZjγ11+ Xijη1j+ ij

(3)

As becomes clear from this notation, pupil character-istics, Xij, interact with coefficients that are

depen-dent on the class characteristics, Zj, and the error

terms on the class level, ηj. Variance thus comes

from two sources in the model; both layers contribut-ing to the total variance. The model in this form can be dissected into a ‘fixed’ part, and a ‘random’ part:

yij = fixed z }| { γ00+ Zjγ01+ Xijγ10+ XijZjγ11 + η0j+ Xijη1j+ ij | {z } random (4)

Since the partitioning of variance is one of our main focuses of interest, defining intercepts only on the lev-els reduces the HLM to a classical ANOVA (Gelman & Hill, 2006): yij = fixed z}|{γ 00 + random z }| { η0j+ ij (5)

The estimate for γ00 is simply the grand mean over

all classes J and η0j represents heterogeneity among

classes. Pupil variability is then given by ij around

14This does not require the assumption of normality of σ2

yet, but estimation by maximum likelihood, as discussed later, will. An alternative to the distributional assumptions for al-lowing correlation could be a factor loading approach (e.g. as applied by Bratti & Miranda, 2011).

15The functional relationship of the lth regressor β lj is as

follows: βlj= Zjγl+ ηljIf Zjwould only contain an intercept

and we force ηljto be equal to zero through constraints on Tη,

we would obtain βlj= γl0, the general effect of the lth

regres-sor in the sample. Adding regresregres-sors other than the constant term in Zj would yield βlj = Zjγl, a nonrandomly varying

parameter. Finally, allowing ηlj to be stochastic would yield

βlj= Zjγl+ ηlj, a randomly varying parameter.

the class mean γ00+ η0j. Formally, total variance in

the model is given by:16

Var(yij) = E{(yij− ˆyij)2}

= E(η20j) + 2E(η0jij) + E(2ij)

= Tη0+ σ

2

(6)

Where the last equality follows from our assumption of independence between levels. Generally for a ran-dom intercept specification, we have:

Var(yij) = Tη0+ σ

2 (7a)

Cov(yij, yi0j) = Tη0 if i0 6= i, (7b)

Cov(yij, yij0) = 0 if j06= j (7c)

Based on this, we can derive the so-called intraclass correlation - the proportion of variance that can be attributed to classes: ρ = Var(η0j) Var(η0j) + Var(yij) = Tη0 Tη0+ σ 2 (8)

To complete the hierarchical model for an educational system, the final level that should be added is that of the school. Expanding with a third level follows naturally from equations (2a) - (2c). Intercept and slope coefficients on the class level are allowed to vary at the school level where only a single school regressor

16The model can be approached as a single equation model

with correlated error terms and n × n covariance matrix Σ. Stacking observations by class, the covariance matrix of the error term would be block diagonal and of the following form:

Σ =        Σ1 ∅n1×n2 · · · ∅n1×nJ −1 ∅n1×nJ ∅n2×n1 Σ2 · · · ∅n2×nJ −1 ∅n2×nJ . . . . . . . .. ... . . . ∅nJ −1×n1 ∅nJ −1×n2 · · · ΣJ −1 ∅nJ −1×nJ ∅nJ×n1 ∅nJ×n2 · · · ∅nJ×nJ −1 ΣJ       

Where ∅a×bis simply an a × b matrix of zeros and Σjis the

nj× njmatrix given by:

Σj=        Tη+ σ2 Tη · · · 0 0 Tη Tη+ σ2 · · · 0 0 . . . ... . .. ... 0 0 · · · Tη+ σ2 Tη 0 0 · · · Tη Tη+ σ2       

(16)

Wk is assumed for simplicity:

Level 1 (pupil): yijk = β0jk+ Xijkβ1jk+ ijk

(9a) Level 2 (class): β0jk= γ00k+ Zjkγ01k+ η0jk (9b) β1jk= γ10k+ Zjkγ11k+ η1jk (9c) Level 3 (school): γ00k = θ000+ Wkθ001+ ζ00k (9d) γ01k = θ010+ Wkθ011+ ζ01k (9e) γ10k = θ100+ Wkθ101+ ζ10k (9f) γ11k = θ110+ Wkθ111+ ζ11k (9g) The regressor Wk now denotes a school

characteris-tic, while Zjk denotes a class characteristic and Xijk

a pupil characteristic. It is assumed that ijk ∼

N (0, σ2), η jk∼ N ( 0 0  , Tη) and ζk ∼ N (     0 0 0 0     , Tζ).

Once again the single equation formulation can be obtained by substitution of equations (9a) - (9g):

yijk= θ000+ Wkθ001+ ζ00k+ (θ010+ Wkθ011+ ζ01k)Zjk

+ (θ100+ Wkθ101+ ζ10k)Xijk+ (θ110+ Wkθ111

+ ζ11k)XijkZjk+ η0jk+ Xijkη1jk+ ijk

(10) And rearranging gives:

yijk = fixed effect β0jk z }| { θ000+ Wkθ001+ Zjk(θ010+ Wkθ011) + fixed effect β1jk z }| { Xijk(θ100+ Wkθ101+ Zjk(θ110+ Wkθ111)) +

random effect pupil level

z}|{

ijk

+

random effect class level

z }| {

η0jk+ Xijkη1jk

+

random effect school level

z }| {

ζ00k+ Zjkζ01k+ Xijkζ10k+ XijkZjkζ11k

(11)

Calculating variances for the random-intercept case follows analogously to the two-level case in (7a) - (7c). These are given formally in the Appendix.

3.2.2 Panel extension of the HLM

The extension to the panel setting is straightforward. Define dependent variable yitas pupil i’s score at time

t. Let t = 1, . . . , T be the time index where T is the final time point available. We could then specify the following non-hierarchic model:

yit= Xitβ + it (12)

Where Xit would contain regressors on the pupil,

class and school levels.17 Note that 

it accounts for

noise over time in this specification. To account for time-constant individual specific heterogeneity like ability, we can extend (12) with a pupil-specific term αi, yielding:

yit= αi+ Xitβ + it (13)

Dealing with the term αi can be done in two ways;

specifying a fixed effects model and differencing out the αior specifying a random effects model, where αi

is assumed uncorrelated with regressors Xit. The

for-mer is perhaps the less restrictive route, since a fixed effects model allows for potential correlation between Xit and random effect αi, however differencing the

model in (13) will restrict inference to time-variant regressors only (Mundlak, 1978). This means we lose inference on many of our variables of interest, which are generally time-constant.18

Specifying a random effects model will allow infer-ence to be done on the time-constant regressors. For convenience, the specification in (13) is rewritten to explicitly account for the time-varying regressors Xit,

its time averages ¯Xi and time-constant regressors

Zi:19

yit= αi+ Xitβ + ¯Xiδ + Ziγ + it (14)

The model in (14) can be extended to account for additional hierarchy by simply adding the relevant levels as was presented in (9a) - (9g). In absence of regressors on the time-level, we can simply add the following level to the initial three-level specification:

17Time regressors could be added if available and of interest. 18IV-methods could be employed to recover consistent

es-timates for the coefficients of these time-invariant regres-sors (Hausman & Taylor, 1981; Amemiya & MaCurdy, 1986; Breusch et al., 1989). Finding both relevant and valid instru-ments if often difficult, however.

19Following Mundlak (1978), only independence between Z i

and αi is needed for consistency. By including time-averages

of Xit:X¯it, any correlation between Xitand αiis completely

absorbed. For some reason this advantage of random effects models is not well documented in the literature, although ap-plications can be found in sociological and biological analyses (Burnett & Farkas, 2009; Kaufman, 1993).

(17)

Level 0 (time): ytijk= cijk+ ξtijk

Level 1 (pupil): cijk= Xijkβ + ijk

.. .

(15)

Where ξtijk∼ N (0, Tξ) and we only add the constant

cijk to the equation, allowed to vary on the pupil

level. Tξ is then some scalar representation of

mea-surement difficulties in testing, whilst ijk represents

the random effects αi in (14). The single equation

formulation and partitioning into a random and fixed effects follow analogously.

Before discussing the estimation technique for these types of models, note that the three level model - and more generally all types of hierarchical models - can be written in the following form, used by Laird & Ware (1982): Yjk= Xjkβ + Z (2) jku (2) jk + Z (3) jku (3) k + jk (16)

In this notation, Yjk are the stacked outcome

vari-ables of all pupils in class j in school k. Furthermore, Xjkgoverns the fixed effects part of the class and Z

(2) jk

the random effects part on the class level and Zjk(3)on the school level, consisting of the relevant cross terms of pupil, class and school regressors as implied by (11). uk and ujk are vectors representing the school

and class level random effects accordingly.

As levels are added, so are random effect design ma-trices Z(l), so rewriting (16) as a two- or four-level

model is straightforward. This notation will be intu-itive when discussing estimation and the non-linear models that follow.

3.2.3 Estimation technique

Generally, hierarchical models are estimated using Maximum Likelihood-based (ML) methods - either Full Maximum Likelihood (FML) or Restricted Max-imum Likelihood (REML) - or a Bayesian approach - Markov Chain Monte Carlo being the most popular technique. These techniques require assumptions on the distributions of the error terms.

In practice, a zero-conditional mean assumption is all that is required for estimation results to be con-sistent when errors are additive. As such, generalised least squared (GLS) would also yield consistent re-sults and require less assumptions on the error terms. Estimation would also require considerably less com-putational power. It has been shown however that the standard errors given by GLS can become rather inaccurate within the context of hierarchical mod-els (Hox, 1998; Kreft, 1996). Also, it becomes less straightforward to account for correlation between

varying coefficients and the researcher relinquishes the handy feature of the likelihood-ratio test to eval-uate model specifications. Most importantly, though, a zero-mean assumption will only be sufficient in a linear model. As such, restrictive distributional as-sumptions will have to be made further down the line within this research. These arguments motivate the choice for ML or Bayesian approaches over GLS. In general, ML and Bayesian approaches only have closed-form solutions in linear models for balanced samples - and assuming normality - meaning that every school and every class should have the same amount of pupils (Raudenbush & Byrk, 2002). Gen-erally, this is not the case and computational methods must be employed. As such, computational efficiency of the problem at hand should be taken into account when choosing between ML and Bayesian approaches. The literature is both expansive and inconclusive on a preference for the methods available. Goldstein (1986) and Longford (1987) make cases for full max-imum likelihood estimation of hierarchical models, whilst Mason (1983) and Bryk & Raudenbush (1986) focus on restricted maximum likelihood techniques. The treatment of HLMs from a Bayesian perspective was pioneered by Hill (1965) and a seminal paper per-forming Bayesian analyses of HLMs can be found in Efron and Morris (1975). Recent textbooks on hier-archical models treat both techniques equally exten-sive (Gelman & Hill, 2006; Rabe-Hesketh & Skrondal, 2012; Snijders & Bosker, 2012).

An extensive discussion and comparison of the perfor-mance of the two techniques can be found in Browne & Draper (2006), who conclude that there are dif-ferences in both point estimates and standard errors between Bayesian and ML techniques. They also find that a preference for either technique depends greatly on the problem at hand, with no clear delineations. For most two-level models, both techniques are found consistent and the automatic nature in which ML-methods achieve consistency are preferred.20 For

three-level models, there is more ambiguity. The spe-cific non-linear example of a logistic example anal-ysed by Browne and Draper performed poorly for the ML approach with respect to the Bayesian approach. Goldstein (2011), however, proposes that with suf-ficient variability across the various levels both tech-niques should be equivalent to one another given that the distributional assumptions are appropriate. In this research, a ML approach is taken due to its computational efficiency with respect to Bayesian analysis (Browne and Draper, 2006). Recent compu-tational developments enable the researcher to evalu-ate the likelihoods to arbitrary precision, minimising

20There are no choices to be made on prior distributions,

(18)

bias arising from the computational method involved. Given the choice for ML, one is then left with the choice for Restricted Maximum Likelihood (REML) or Full Maximum Likelihood (FML).

The difference between the two is subtle. The former constructs likelihood formulas based on linear com-binations of the fixed effects, which enables the re-searcher to optimise the likelihood conditional on the fixed effects. Additionally, REML corrects for the de-grees of freedom used in estimation. With sufficiently large datasets, the difference between the two become negligible. However, FML concerns actual calcula-tion of the likelihood and thus enables the researcher to use LR tests to evaluate model specifications, like the necessity of a hierarchical model in the first place (Cameron & Trivedi, 2005). In this research, FML will be performed unless specified otherwise.

For notational simplicity, consider the two-level model in the notation of (16). It is clear that the likelihood for a single class j is as follows:

Lj(Yj, ω) =

nj

Y

i=1

f1(Yj|uj, ω)f2(uj|ω) (17)

Where ω concerns all parameters of interest includ-ing coefficients and (co)variance estimates of the ran-dom effects, f1(·) and f2(·) are multivariate normal

densities. The likelihood is thus constructed out of the probability of observing Yj given errors uj and

parameters ω times the probability of observing uj

given parameters ω.

Taking the logarithm yields the following log-likelihood: Lj(Yj, ω) = nj X i=1 ln[f1(Yj|uj, ω)f2(uj|ω)] (18)

Where the total likelihood is then the sum over all classes J, given by:

L(Y, ω) = J X j=1 nj X i=1 ln[f1(Yj|uj, ω)f2(uj|ω)] (19)

Numerical techniques will have to be used to evalu-ate the likelihood in (19) in the absence of a balanced sample. The most common technique, which is ap-plied in this research, is the EM algorithm (Dempster et al., 1977). This technique assumes the random ef-fects to be missing data and iteratively updates the covariance matrix of the random effects and the fixed effects in the model. The algorithm and explicit form of the likelihood in (19) will be elaborated upon in the Appendix.

3.2.4 Statistical inference

For the hierarchical models discussed here statistical inference is straightforward on all estimated coeffi-cients in the fixed effects part of the specification. For the random effects, covariance parameters follow di-rectly from optimising the likelihood function. A LR test of the model with and without level-components will give an indication of the need to add levels to the model in the first place. By estimating the model with and without a random effect, it can be evaluated whether the variance estimate for that random effect is greater than zero.21

3.3

Pupil advice

Research focus: How do pupil, class and school characteristics relate to the selection process into sec-ondary education in the Netherlands?

Pupils leave the primary school system with a single, concrete advice out of a possible eight alternatives. Every advice corresponds to a single secondary ed-ucation track, but also gives entrance to all ‘lower’ tracks. These advices are composed out of two sepa-rate assessments, one by the pupil’s teacher and one through participation in a CET - a centralised testing mechanism. Interest lies in the drivers behind both assessments, the drivers of a mismatch between the two and finally the drivers of a potential revision after a mismatch.

The two assessments vary in the nature of the ob-served variable. A teacher’s assessment is obob-served in terms of the possible advices available and thus has a categorical nature. The assessment by the CET is simply a score on a standardised test. The latter can then be transformed into one of the eight available advice levels and a mismatch follows directly from a comparison between the two. Although technically constituting a loss of information, comparability of the two models is much more straightforward when CET score has been transformed, since the initial score does not transpose linearly into the eight ad-vice levels.

For these categorical data, an ordered response model is appropriate given the fact that there is a natural ranking of the available levels. To this effect, a Hi-erarchical Ordered Probit Model (HOPIT) wil be

es-21This LR test does not follow the standard χ2-distribution,

since we are operating on the boundary space of our param-eter of interest. To see why, note that negative values of the covariance parameters are not allowed for and thus standard tests would inflate the resultant test statistic. As such, a mixture of χ2(0) and χ2(1) should be used where χ2(0)

rep-resents the peak density at zero (Rabe-Hesketh & Skrondal, 2012). In general, the asymptotic null distribution for test-ing whether the q + 1th random effect is zero takes the form 0.5χ2(q) + 0.5χ2(q + 1).

(19)

timated. Additionally, an HLM model for the actual CET score will be added as well and comparing the two can indicate the effect of the loss of information due to the transformation of the data.

To clarify terminology: ‘teacher advice’ is defined as the advice granted to a pupil by a teacher as his/her assessment of the pupil’s ability. ‘CET score’ is de-fined as the score a pupil received on the central test. ‘CET advice’ refers to the ordinal transforma-tion from CET score to advice level. Whenever the CET advice differs from the teacher advice, a ‘mis-match’ has occurred. These can be upward (CET ad-vice > teacher adad-vice) or downward (teacher adad-vice < CET advice). Given a mismatch, teacher advice can be revised up- or downwards to coincide (more closely) with the CET advice. This yields a pupil’s ‘fi-nal advice’, which is the definitive advice with which the pupil leaves the primary school system.

In what follows, a multinomial model is discussed that can provide estimates for the drivers of teacher advice and CET advice (sub-question 1). Then, a dichoto-mous model is discussed that will provide estimates of the drivers of a mismatch (sub-question 2) and of a revision (sub-question 3). The dichotomous case is a simplification of the multinomial case and follows nat-urally. The analysis of revisions will, however, require some additional discussions on sample selectivity and bootstrapping in order for consistent estimates and correct standard errors to be obtained.

3.3.1 Hierarchical ordered response model For a categorical dependent variable, yi = m, where

m ∈ [1, . . . , M ] and given a natural ordering to these categories, an ordered response model is appropriate. Following Maddala (1983), such an ordered response model can be approached through a latent variable, y∗i, which depends on regressors Xiin a linear fashion:

y∗i = Xiβ + i (20)

The total domain of this latent variable can be parti-tioned into M sections using M + 1 cut-off points, µm

where m ∈ [0, . . . , M ]. We then obtain the following probabilities of observing response m ∈ [1, . . . , M ]:

yi= 1 if µ0< Xiβ + i≤ µ1,

yi= 2 if µ1< Xiβ + i≤ µ2,

.. .

yi= M if µM −1< Xiβ + i≤ µM

Where we set µ0 = −∞, µM = ∞ and µm+1 > µm

for all m. By assuming i in (20) to be normally

distributed, we obtain the ordered probit model with the following associated probabilities:

Pr[yi= 1] = Φ(µ1− Xiβ) Pr[yi= 2] = Φ(µ2− Xiβ) − Φ(µ1− Xiβ) .. . Pr[yi= M ] = 1 − Φ(µM −1− Xiβ) (21)

To account for the structure of the data and extend the model to the hierarchical case, the extension is analogous to the HLM case from Section3.2. Assum-ing two layers and a sAssum-ingle regressor for notational simplicity, we obtain the following hierarchical struc-ture of the latent variable in (20):

Level 1 (pupil): yij∗ = β0j+ Xijβ1j+ ij (22a)

Level 2 (class): β0j = γ00+ Zjγ01+ η0j (22b)

β1j = γ10+ Zjγ11+ η1j (22c)

This hierarchical linear model for y∗ can once again be rewritten compactly into the form defined in (16): yij∗ = Xijβ + Zijuj+ ij (23)

By defining τij= xijβ +zijujwe obtain y∗ij= τij+ij

and can proceed with the same ordered probit contin-uation as before with the probability that observation yij = m defined as:

pijm= Pr [yij = m|µ, uj]

= Pr [µm−1< τij+ ij≤ µm]

= Φ(τij− µm) − Φ(τij− µm−1)

(24)

Where it is assumed that ijk ∼ N (0, 1) and ηj ∼

N (0 0 

, Tη). The full, three-level specification follows

naturally, since additional levels are added to (23) straightforwardly.

3.3.2 Estimation technique

Disregarding the structured nature of the data for the moment, we obtain the following likelihood, which has to be optimised over M − 1 cutoff points µmand

parameters β: L(Y, β, µ) = N Y i=1 f (yi|Xi) = N Y i=1 M Y m=1 pdim im (25)

Referenties

GERELATEERDE DOCUMENTEN

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

In  this  section  the  values  transmitted  by  the  ISensorboard  will  be  analysed  by  saving 

This  project  involved  the  development  of  a  communication  sub‐system  for  the  ADES.  The  communication  system  was  divided  into  two  sections, 

Table 6: Mean Prediction Errors per each review/property along with their confidence intervals levels Notes: The prediction errors numbers derived by taking the

Het maatschappelijk middenveld komt hiermee niet meer tot de ontwikkeling van gedeelde waar- den die noodzakelijk zijn voor een florerende de- mocratische samenleving..

Figure 10 shows the measured output signal as a function of the calculated volume flow (derived from the pressure sensor signal) for water, ethanol and white gas... Figure 8:

In the present study, we will extend the RSH relation to the temporal domain and test its validity along the trajecto- ries of fluid tracers and of inertial particles whose density

Keywords: South Africa, central bank communication, inflation expectations, consistent communication, monetary policy transmission mechanism, transparent monetary