Predicting high school student performance : a Hierarchical Early Warning Indicator System

(1)

P r e d i c t i n g h i g h s c h o o l s t u d e n t

p e r f o r m a n c e

A H i e r a r c h i c a l E a r ly Wa r n i n g I n d i c at o r S y s t e m

M S c L . C . E . H u b e rt s U n i v e r s i t y o f A m s t e r d a m

O c t o b e r 2 8 , 2 0 1 6

Abstract

Early Warning Indicator (EWI) systems help school teachers and management in student guidance and counseling. This paper develops a novel modeling approach to predict high school student performance for use in EWI systems. A predictive hierarchical model is estimated that incorporates six sources of information; a student’s observed performance, performance of the other students, course performance and performance of other courses, course observed charac-teristics and individual student observed characcharac-teristics. The model serves as a robust example of the power of hierarchical modeling. The resulting hierarchical specification is estimated us-ing Markov Chain Monte Carlo methods, on a large high school data set from a Dutch high school. Prediction using these estimates offers useful expectations on student performance. The procedure results in a predicted distribution of grades for each course and every student, where the identification of ’at risk’ and ’excellent’ student-courses is the top priority. The predictions correctly identify the at risk and excellent grades from around 70% of the time before the start of the year, to 100% after 70% of the year has passed. The model outperforms simpler specifica-tions and is most valuable for the first half of the school year. The results can be used to aid schools in efficient and effective student guidance and performance improvement in the form of a Hierarchical Early Warning Indicator System (HEWIS).

Thesis supervisor: Dr. M.J.G. Bun University of Amsterdam

(2)

INTRODUCTION

”Early Warning Indicator Reports were in-valuable to the success of our school” (High school principle, quote from the Strategic Data Project Report, 2016). Predictive analytics has started to make the leap from the busi-ness world into educational systems worldwide. Companies like IBM offer solutions for colleges and universities to use predictive analytics in improving student performance, retention, en-rollment and even campus safety. A recent re-port by the Harvard University strategic data project emphasizes the need for actionable pre-dictive analytics in high schools to help keep students on track toward graduation and better prepare them for college and career success 1. The report discusses three examples of early warning indicator systems (EWI) that help school teachers and management with early identification of students with lower probability of passing, based on logistic regressions of stu-dent grade/attendance information. The mod-els used are quite simple, but the process is not yet automated for school’s convenience.

This study extends the existing EWI models by allowing for the use of more information and more specific prediction. The hierarchi-cal models proposed in this paper do not di-rectly predict the logit probability of passing a grade, but predict distributions for all indi-vidual grades, from which the pass/fail prob-abilities can be subsequently calculated. This allows for the integration of unobserved stu-dent and course heterogeneities resulting in more accurate predictions. Furthermore, the 1_{See Becker, Hall, Levinger, Sims, & Whittington}

(2016) for an evaluation of the introduction of early warning indicator systems in the US.

hierarchical Bayesian approach allows for es-timation of student grades with little or no student specific prior information, something that the EWI models of the Strategic Data Project struggle with. The proposed models further allow for updating according to incom-ing grades/attendance information, which fa-cilitates automation and implementation into high school procedures.

Hierarchical models, also known as multi-level, nested and mixed models, have gained a lot of traction in the past 20 years for ex-pressing uncertainty at several levels of aggre-gation (Browne & Draper, 2006). These hier-archical models have become easier to imple-ment due to increases in computing power and sampling and likelihood techniques. There are several ways to estimate multilevel models, us-ing both Bayesian and frequentist methodology. Bayesian methods have seen an increase in fea-sible estimation due to advances in Markov Chain Monte Carlo techniques, frequentist ap-proaches use likelihood-based techniques. The dis- and advantages of both approaches will be shortly discussed in the method section and analyses will be done using the hierarchical Bayesian modeling approach.

The determinants of academic performance are currently under review in the Netherlands, as reports show that primary school teacher’s advice is very biased2_{. This has increased the}

demand for research into children’s academic performance and its underlying factors. Dif-ferent approaches can be taken, either aggre-2_{See the report by the Dutch Inspectorate of}

Ed-ucation (2016) for an extensive analysis of the Dutch educational system in recent years and news reports by NRC Handelsblad (2016, April 15) and De Volkskrant (2016, April 13 & 2016, April 24).

(3)

gating school data to compare schools (Jack-son & Lunenburg, 2010) or looking at individ-ual student data (Parker et al., 2004). Average test scores and dropout rates are often used as outcome variables when looking at individual student academic achievement (Nichols, 2003). The analyses in this paper are based entirely on within school data, with either test scores or weighted averages as outcome variables.

An advantage of predicting individual grades is that it allows for calculation of expected drop out rates. The last section of this paper will calculate these probabilities, which may be used as the main indicator of the EWI. High school dropout has been analyzed in sev-eral studies, providing insight into factors in-fluencing student performance. An example of a test of theories on high school dropout is given by Battin-Pearson, Abott, Hill, Catalano & Hawkins (2000), who use nested latent vari-able models to test five theories on dropout predictors. They consider poor academic perfor-mance to be the mediating effect between many different variables and high school dropout.

The main form of performance measurement in high schools world wide is through the grading of students. These grades can be of numeric or alphabetic form, but they represent some relative performance on a standardized scale. The grades determine if a student passes high school, the universities the student is allowed to attend and subsequently the careers and lives that the students will live. This places great responsibility on schools to maximize their student performance in terms of grades. This paper attempts to establish a framework that can help schools achieving just this goal, by connecting the research into drivers of academic performance toward

an actionable Hierarchical Early Warning Indicator System (”HEWIS”). HEWIS has the potential to aid school staff in early intervention and offer an objective evaluation of student performance through time.

The model estimations will be considered an improvement beyond the current status quo if the predictions provide a considerable im-provement on simply using the current average grades for students as predictions for their end of year average grades. The results will show that the model achieves considerable improve-ment, where the model predictions before the start of the school-year even outperform pre-dictions using the cumulative weighted average after 10% of grades are known. As more grades are accumulated, the added value of the model obviously decreases as the variability in the grades decreases. An advantage of using the proposed model is that estimation using the Markov Chain Monte Carlo methods results in a distribution for the individual grades, which in turn can be used to estimate the probability of passing a particular course and subsequently passing the year.

The ability to identify the ’tails’ of student grades (i.e. the failing/excellent students) is essential, as these grades might benefit from intervention by the school’s teachers and man-agement. To identify these tails, the maximum amount of identifiable variability should be in-corporated, while avoiding regression towards the mean and over-fitting.

The famous Coleman report (1966) argues that only 10% of variance in pupil achievement is actually due to school characteristics. This controversial finding has led to considerable new research, which has not yet been able to

(4)

end the controversy (Rivkin, Hanushek& Kain, 2005). Implementation of a red flag system based on the proposed model could increase the positive influence of schools on student achieve-ment by more effectively allocating guidance and counseling to students at risk of failure, and offering additional opportunities to the students that can handle them.

The structure of the paper is organized as follows; first, theory on academic performance will be discussed, followed by a discussion of hi-erarchical modeling and estimation techniques. The proposed model will subsequently be de-scribed in detail. In the application section, the model will be applied to a large data set provided by a Dutch High School and exten-sively evaluated using this data. Concluding this paper, the results of the modeling approach will be discussed and suggestions for further re-search considered.

PREDICTORS

OF

STU-DENT PERFORMANCE

This section will consider some of the educa-tional literature on student performance pre-dictors. Most studies focus on evaluating the added value of schools, teachers or educational programs toward some measurement of edu-cational performance. The predictors, their ef-fects on performance and their modeling ap-proach in this paper are summarized in Table 1.

The ’difficult to measure’ predictors of this section represent variables that were not ob-served in the application later in this paper, but the hierarchical modeling specification in-corporates many of these ’unobserved

differ-ences’ between students and students within courses.

Individual characteristics

Nichols (2003) looks at prediction indicators for students failing a specific graduation exam. He finds increased school absences for students that struggle academically and that students from lower income families have a higher prob-ability of poor results. Furthermore, Nichols (2003) observes a significant relationship

be-tween poor achievement in students beginning educational careers and later poor achievement. This suggests an important role for family in-come, absences and temporal effects in predict-ing individual high school performance.

Socioeconomic status (SES) has long been ar-gued to significantly affect school performance, although the importance varies greatly among different analyses. White (1982) analyzes al-most 200 studies concerning the relationship between SES and academic achievement, con-cluding that when a student is the unit of anal-ysis and traditional measures of SES are used, there is little utility in using SES as a covariate. When schools or other aggregated groups are the unit of analysis, traditional measures of SES are usually correlated with academic achieve-ment measures. Sirin (2005) repeated the meta-analytic review of research that White (1982) performed, using more recent data from the 1990s. The results found by Sirin (2005) show a slightly smaller magnitude of the SES-school achievement relationship, but do suggest us-ing SES in research concernus-ing school success. Geiser & Santelices (2007) argue omission of so-cioeconomic background factors can lead to sig-nificant overestimation of the predictive power of academic variables, that are strongly

(5)

corre-Table 1: Summary of predictors and modeling approach Predictor Effect on performance

Student level Class level Modeling approach

SES + Explanatory Variable

Disabilities - Explanatory Variable

Language -/+ Explanatory Variable

Non-native -/+ - Explanatory Variable

Student Effort + + Student unobserved heterogeneity

Peer associations +/- +/- Student/course unobserved heterogeneity Parent Involvement + Student unobserved heterogeneity School Climate +/- +/- Course unobserved heterogeneity

Intelligence + Explanatory Variable, student unobserved heterogeneity

Grades + Time varying Explanatory/Dependent variable

Absences - - Time varying Explanatory variable

lated with socioeconomic advantage.3_.

Learning disabilities

The most common learning disabilities in high school children are Attention- Deficit/ Hyper-activity Disorder (ADHD) and Dyslexia. Chil-dren with ADHD generally perform at an aca-demically lower level than predicted by their intellectual abilities (Barry, Lyman & Klinger, 2002).

Dyslexia affects around 5.3-11.8% of school children in the USA and dyslexic children fail to achieve school grades at a level that is commensurate with their intelligence (Karande & Kulkarni, 2005). Spelling mistakes, untidy 3_{Geiser & Santelices (2007) base this assumption on}

a study by Rothstein (2004), which argues the exclusion of student background characteristics from prediction models inflates the SAT’s apparent validity by over 150 percent.

or illegible handwriting and the inability to perform simple mathematical calculations cor-rectly are the hallmarks of this life-long condi-tion (Karande & Kulkarni, 2005).

Although they might not be directly linked to learning, disabilities like Asthma, epilepsy and autism can indirectly influence academic per-formance. Asthma can cause increased absen-teeism, which in turn can cause lower academic performance (Karande & Kulkarni, 2005). New onset epilepsy causes children to be vulnerable when processing memory tasks and epileptic drugs that are used to combat epilepsy can affect cognition adversely (Bourgeois, 2004).

Autistic children can face a lot of problems in school as their core features impair learn-ing (Karande & Kulkarni, 2005). Furthermore, medical problems like visual impairment, hear-ing impairment, malnutrition and low birth weight can cause difficulties in school (Karande

(6)

& Kulkarni, 2005). Immigration & Language

The language that a child speaks at home can influence their academic abilities both positively (Buriel et al., 1998) and negatively (Kennedy & Park, 1994). Collier (1995) finds that immigrants and language minority stu-dents need 4-12 years of second language devel-opment for the most advantaged students to reach deep academic proficiency and compete successfully with native speakers.

It has been suggested that the presence of non-native speakers in schools has a negative ef-fect on the performance of native speakers, but this has been refuted by Geay, McNally & Tel-haj (2013). In contrast, children who interpret for their immigrant parents; ’language brokers’, often perform better academically (Buriel et al., 1998).

Although often suggested otherwise, recent academic works show that second generation immigrants often perform better academically than their native counterparts 4_{. Ohinata &}

Van Ours (2013) find raw correlation between higher immigrant presence in the classroom and a worse learning environment, but do not find evidence for negative spill-over effects on Dutch children.

Nationality and language, learning disabili-ties, SES and several other individual charac-teristics are part of the observed characteris-4_{See Dustmann and Glits (2011) who show that}

the share of tertiary education for the foreign-born population exceeds that of native-born population in England, and Dustmann, Frattini and Theodorpoulos (2011) who show that second-generation ethnic minor-ity immigrants tend to be better educated than their white native peers

tics that are included in this study’s predictive model. The following section discusses some of the variables that will remain unobserved, but explicitly modeled by allowing for two forms of unobserved heterogeneity.

Difficult to measure predictors

The list of variables that are important to aca-demic performance, but difficult to measure, ranges from parent involvement to peer pres-sure and individual intelligence. Some impor-tant variables found in the literature on aca-demic achievement are discussed in this section. Student effort

Student effort toward educational achievement or attainment is characterized by the level of school attachment, involvement and commit-ment displayed by students (Stewart, 2008). Ef-fort has been shown to be important towards academic results (Johnson, Crosnoe & Elder, 2001; Marks, 2000; Natriello & McDill,1986). Analyses showing a relationship between school attachment and academic achievement are given by Johnson , Crosnoe & Elder (2001) and Roscigno & Ainsworth-Darnell (1999). Ex-tracurricular activities have been demonstrated to influence grades (Broh, 2002; Fejgin, 1994; Guest & Schneider, 2003; Marsh, 1992; McNeal, 1995). According to the literature, student ef-fort as it relates to increased school involve-ment, school attachinvolve-ment, and school commit-ment should positively affect academic achieve-ment (Stewart, 2008).

Peer associations

Associations between high school students mat-ter a great deal to individual academic

(7)

achieve-ment and developachieve-ment. Peer influence can be both positive and negative, where meaning-ful relationships promote psychological and life skills that can positively affect academic achievement and effort (Nichols & White, 2001). Negative peer pressure on the other hand has been shown to discourage students from conforming to behaviors that raise academic achievement (Goldsmith, 2004; Ogbu 1995a, 1995b). In summary, according to the literature, peer associations can both positively and neg-atively influence individual academic achieve-ment (Stewart, 2008).

Parent involvement

Parent involvement is a difficult variable to identify without further effort by the schools. It is intuitively likely to influence academic achievement, although there is some debate regarding the actual significance. Suis-Chu & Willms (1996) analyzed the effect of parent involvement on eighth-grade achievement in the U.S. using four dimensions of parent in-volvement: home discussion, home supervision, school communication and school participation. They find no evidence that SES is correlated with parent involvement, nor are race or differ-ences in schools. The most important dimen-sion of parent involvement towards academic achievement in their analysis is home discus-sion. They suggest facilitating home discussion by providing concrete information to the par-ents about parenting styles, teaching methods and school curricula.

Fan & Chen (2001) confirm the inconsistency in the literature on parent involvement. In their meta analysis they find that global study dicators are more correlated with parent in-volvement measures than more specific

indica-tors. They state that it is desirable that future studies should include both SES and parental involvement and should examine the relation-ship between the two and academic achieve-ment. Meta-analysis by Jeynes (2007) suggests significant overall influence of parental involve-ment for secondary school children, which holds across different races.

The school disciplinary climate

The school disciplinary climate will influence its student performance in different dimensions (Stewart, 2008). ”School climate is the heart and soul of a school” (Freiberg & Stein, 1999, p.11). A school’s culture is important (Dupper & Meyer-Adams, 2002), school organizational structure represented by class and school size is thought to affect student outcomes and the ’social milieu’ of the school shapes students

val-ues, beliefs, attitudes and behaviors regarding academic achievement and attainment (Stew-art,2008). These factors are largely constant for a single school, although they might vary slightly over time. Rumberger & Palardy (2005) show that class disruptions have a negative im-pact on the student’s performance.

Intelligence

Both general cognitive ability and emotional intelligence are important for academic achieve-ment (Rohde & Thompson, 2006; Laidra, Pull-mann & Allik, 2006, Parker et al., 2006). Jencks et al. (1979) report correlations ranging from 0.40 to 0.64 between cognitive test scores and the amount of education obtained. Deary, Strand, Smith & Fernandes (2007) find a large correlation of a latent intelligence trait at age 11 and educational achievement at age 16, with correlations as high as 0.81. The authors find

(8)

a gender gap in academic achievement, which was not present in the test of general intelli-gence.

Parent involvement, the disciplinary climate and individual intelligence are usually quite difficult to measure. This study aims to incor-porate them nonetheless. Parent involvement is incorporated mostly in student unobserved heterogeneity. Limited observed information on the parents is included in the predictive model (i.e. Education level and SES). Disciplinary

cli-mate and class disruptions are mostly covered by including absences that equate to dismissals from class and within unobserved course differ-ences. Intelligence is approximated using unob-served student differences, combined with pre-vious test results (i.e. CITO score).

Time varying variables Grades

A high school student obtains test results throughout the year. HEWIS predicts the end of year grade, which is a weighted average of all the individual test grades of the completed year. Anytime during the year, there is a cur-rent weighted average grade for each student and course. This current average is a deter-mined part of the end of year grade, making it the strongest possible predictor.

Absences

Regular attendance is important for a stu-dent’s high school performance (Rothman, 2001). Most students will be absent from school at least a few times per year, reasons can vary from serious illness to ditching class. Absences will influence a student’s performance, as at-tending class helps students understand the

ma-terial and motivates their participation (Roth-man, 2001). Absences can also function as me-diating effect for disabilities (Moonie, Sterling, Figgs & Castro, 2008). A school needs to track absences separately from grading and they usu-ally do already. Absences without a valid rea-son will usually be penalized by the school. Various temporal effects

Temporal effects in student performance en-compass both inter-year changes and intra-year changes. Students will change the allocation of their effort and time according to their current average grade, their average grade for other courses, seasonal effects, within school changes and external factors. Students will allocate ef-fort to subjects that are of higher importance at the moment, or to parts of courses that are more interesting to the student. Furthermore, external factors like tutoring and various per-sonal circumstance are likely to influence a stu-dent’s grades both intra- and inter-year. Ide-ally, modeling will allow for student and course specific effects to vary over time.

METHOD

This section discusses the hierarchical model-ing approach to predictmodel-ing the student perfor-mance. Peer pressure and peer conformity, en-vironmental factors, teaching style and teach-ing quality call for a group-level hierarchical approach, because they all affect groups of stu-dents together. Explicitly modeling unobserved differences between these groups can capture some of these effects.

Santor, Messervey & Kusumakar (1999) look at peer pressure in adolescents; a difficult to

(9)

measure factor that can influence school per-formance. They find that peer pressure and peer conformity are potentially great risk fac-tors to students’ performance. Scheerens (1990) discusses variables and research that are impor-tant for within school functioning of students. He argues the key process-variables for school effectiveness to include stimulating environ-mental factors, achievement oriented policies, educational leadership, amount of instruction, learning opportunities and structured teaching. These variables differ mostly between schools, but they might also vary within school; be-tween different courses. Rivkin, Hanushek & Kain (2005) validate the impact of schools and teachers in influencing achievement. They find significant powerful effects of teachers on read-ing and mathematics achievement, although little of the variation in teacher quality is ex-plained by observable teacher characteristics. This, again, validates a hierarchical approach as teachers influence the students within a spe-cific course.

Hierarchical modeling is a generalization of regression methods and will allow for the ex-pression of uncertainty at the various nested levels within a school. Compared with classi-cal regression, hierarchiclassi-cal modeling is almost always an improvement and for predictive mod-eling it can be essential (Gelman & Hill, 2007). The hierarchy will allow for predictions of new groups, which is essential considering the flu-ent nature of the studflu-ent population within a school. It furthermore allows for reasonable estimates of grades for students with small within-student sample sizes, which would be difficult using classical regression (Gelman & Hill, 2007).

Another advantage of specifying a full

hier-archical model is the ability to separately es-timate the effects of observed and unobserved group characteristics, which is not possible in a classical fixed effects regression model where the group-level predictors are absorbed in the effects of the group dummies.

The model I propose will contain three dif-ferent levels. The lowest level consists of the end of year grades of a single student for a single course. These grades will be considered part of two other levels, the student-level and the class-level, which are both integrated in the model. A single class is attended by mul-tiple students, which suggest making classes the highest level of hierarchy. The reason why this hierarchical step is unfavorable, is that it would restrict every student to be a part of a single class. For most high-schools however, students are a part of various classes for differ-ent courses, where the composition of those classes usually differs. In essence, there are multiple clusters of students for single course-categories. Embedding the student level into the class level would mimic the primary school situation where all course-subjects are taught within the same class. The solution is to have parallel student and class-levels, where the low-est level end of year grade is simultaneously a part of a student level, as well as a course level.

At the class level, positive peer associations and better teaching might improve the results of some classes compared to others. At the student level, there are varying level of intel-ligence, effort and absences. The three levels will contain level-specific predictors, as well as unobserved level specific random effects. The predictors capture observed within level differ-ences, whereas the random effects within a level capture the unobserved differences within one

(10)

of the three levels. Each level will have a distri-bution, the parameters of which represent the within group similarities. This set up results in a multilevel model that is more reasonable than classical no-pooling and complete pool-ing regression models (Gelman & Hill, 2007). The hierarchical model represents the middle ground between these two extremes, as it al-lows for low level variation without over-fitting through the use of higher level mean distribu-tions.

By embedding the individual end of year grades in both a student and course level, the model will allow for predictions that incorpo-rate both known and unknown differences that exist in these two levels.

Consider the example of a second year stu-dent starting second year geography class. The predictions will firstly use the effects of ob-served characteristics of both the student and the course toward the prediction. Student de-mographics, course characteristics and all their interactions will have estimated parameters, that will be used to predict the effects of those observed characteristics on the expected second year geography grade of the student in question. Secondly, the model estimates parameters for the unobserved differences between courses and students. On the student level, the model esti-mates student specific unobserved differences which interact with the course characteristics. For the second year geography student, this will include a baseline of their ability, their interest in geography and the differences between this students first and second year effort. On the course level, the model estimates course spe-cific unobserved differences that interact with the student variables. For the second year geog-raphy student, these parameters will capture a

baseline estimate of the difficulty of the geog-raphy course, and the correlation between stu-dent specific variables like gender and disabili-ties and the performance within that course.

One of the major advantages of this setup is that it allows for prediction of a student’s grade for a course that he or she has never done before, based on his or her observed and unobserved characteristics. The same is true for an entirely new student following existing courses, the model will be able to predict the expected grade based on the course’s observed and unobserved characteristics and their inter-actions with observed student characteristics. In marketing terms, the model combines a form of collaborative filtering with content based fil-tering (Ansari, Essegaier & Kohli, 2000).

Considering the effects that have been dis-cussed in the previous section, the framework in which expected end of year grades will be calculated is now discussed. This model is in-spired by the discussed marketing literature and various studies in educational research 5_.

The discussed student and course levels will be separately motivated, after which I will de-scribe the resulting full model with both the ob-served and unobob-served effects of the two levels incorporated. This inter-year model will sub-sequently be extended, to allow for intra-year predictions.

Inter-year

The inter-year model predicts the end of year average grades for all courses and students be-fore the start of the year, thus not including

5_{The following model is inspired by literature on}

col-laborative and content-based filtering, see for example Ansari, Essegaier & Kohli (2000) on Internet Recom-mendation Systems.

(11)

any intra-year information. Student Level

To estimate the expected end of year average at the beginning of a year, the grade averages are modeled allowing for individual student differ-ences. Students generally follow a fixed set of courses through the year, but not all students have the same set of courses. This yields unbal-anced data. Define i = 1 to I as the individual students and j = 1 to J as all the courses that are offered within a school. Define the set of courses taken by student i as Ci = j1, j2, ..., jni.

Each of the courses within this set for student i will receive a grade, which at the student level will be modeled by the following specification:

gij = xcj 0

βi+ eij, eij ∼ N (0, σ2)

(1)

where course j is in student i’s course set Ci, xcj

is a vector of known course attributes for course j and βi is a vector of parameters specific to

stu-dent i. This vector represents the connection to the student level model, where unobserved student differences are introduced. The nor-mality assumption of the error term should not affect the estimates of the coefficients, ensured by asymptotic normality of the posterior dis-tribution, even if the true distribution of the data is not within the parametric family under consideration (Gelman, Carlin, Stern & Rubin, 2014).

To model these unobserved student differ-ences, the vector of parameters βi specific to

student i is specified as: βi = xsi

0

γ + φs_i, φs_i ∼ N (0, Φs) (2)

for i = 1 to I. In this equation xs

i contains

ob-served student characteristics (i.e. intelligence

indicators, nationality, language) and φs

i

rep-resents the unobserved student specific effect (i.e. parent involvement, peer association,

stu-dent effort). Work by Verbeke & Lesaffre (1997) and McCulloch & Neuhaus (2011) shows that the normality of this random effect will deliver consistent estimation, even when the actual random effect distribution is not normal. As βi

determines the slope of the effects of the ob-served course characteristics for student i, the unobserved student specific effects φs

i interact

with course specific observed variables.

Collecting the two equations, the model can be written as: gij = xscij 0 γ + xc_j0φs_i + eij, (3) eij ∼ N (0, σ2), φsi ∼ N (0, Φ s₎

for i = 1 to I and j ∈ Ci. In this third

equa-tion, xsc_ij is a vector containing all student and course attributes, as well as their interactions. The vector γ contains parameters for the effects of the observed student and course specific char-acteristics and their interactions, xc

j contains

observed course attributes and φs_i is a vector of student i’s unobserved characteristics. The covariance matrix Φs _{determines the amount}

of student specific unobserved differences. Course level

The differences between various courses are often not easily captured in a set of fixed vari-ables. Student performance in a specific course is determined by a variety of course characteris-tics which interact in complex ways. Similarity in geography courses in first grade and geogra-phy courses in the second grade is intuitively clear, but the performance for a single student might differ greatly between these two courses. This can be due to the content of the course, the

(12)

teacher, popularity among peers and many oth-ers. Generally a lot of these course attributes are unobserved, calling for modeling of unob-served course differences. Similarly to the stu-dent level, the observed and unobserved effects of course characteristics at the course level are modeled to interact with the student charac-teristics at the grade level.

Every course has a set of students that take the course, this set consists of all students that have taken the course over the years. Define this set of students as Sj = i1, i2, ..., inj for

course j. Now course j’s grade for student i is defined as gji, where student i is in set Sj. The

set of students Sj generally differs per course

which results in an unbalanced data-set. The grade of course j for student i will be modeled as:

gji = xsi 0

βj + eji, eji ∼ N (0, σ2)

(4)

where student i is in the set of students Sj

for course j. The vector xs

i contains student i’s

observed characteristics, βj is a vector

contain-ing the course specific parameters for course j, which is modeled as:

βj = xcj 0

µ + φc_j, φc_j ∼ N (0, Φc₎

(5)

for courses j = 1 to J . Vector xc

jcontains all the

observed course attributes, φc_j represents the course specific unobserved differences. Combin-ing equations 4 and 5 gives:

gji = xcsji 0

µ + xs_i0φc_j + eji,

(6)

eji∼ N (0, σ2), φj ∼ N (0, Φc)

for all courses j = 1 to J and students i ∈ Ci.

In this third equation, xcs_ij is a vector containing all observed course and student attributes, as well as their interactions. Variance matrix Φc

determines the extent of course specific unob-served differences in expected student grades.

Combining the student and course levels Combining the student and course levels into the model for student i’s grade for course j allows for estimation of both observed and un-observed differences between courses and stu-dents. Combining equations 6 and 3 results in the full model for the expected average end of year grade: gij = xscijβ + xsi 0 φc_j + xc_j0φs_i + eij, (7) eij ∼ N (0, σ2), φcj ∼ N (0, Φ c_{), φ}s i ∼ N (0, Φ s₎

for i = 1 to I and j ∈ Ci. The vector xscij

contains all student and course specific ob-served characteristics as well as their interac-tions. This vector is equal to the xcs_ij vector of equation (6), thus γ and µ are combined into β which contains the effects of the observed char-acteristics and their interactions. xs

i is a vector

of student specific characteristics which inter-acts with the course specific random effects φc_j. The vector of course specific characteristics xc

j interacts with the student specific random

effects φs i.

Estimation Procedure

The full model can be estimated using both like-lihood and Bayesian techniques. This paper es-timates the model using Markov Chain Monte Carlo (MCMC) simulation in the Bayesian con-text. A discussion of Bayesian and likelihood-based techniques for multilevel models is given by Browne & Draper (2006). The Bayesian method using MCMC estimation should pro-vide unbiased estimates of the parameters. The MCMC method applied in this article uses the

(13)

Gibbs sampling procedure6_{, which requires the}

full conditional distributions of the unknown model parameters. These and the specification of the prior distribution of the unknown param-eter space are described in the appendix. Ran-dom draws are generated for parameter blocks {β, Φc_{, Φ}s_{, σ}2_{}, using the Gibbs sampling}

pro-cedure. Intra-year

This subsection extends the proposed inter-year model of equation (7) to allow for intra-year predictions of end of intra-year grades. There is a very obvious connection between a students’ test results and their end of year average grade, as the end of year grade is made up of the weighted average of all their test results. Thus as a student starts to obtain test grades, predic-tions for end of year averages should become more and more accurate. To introduce time varying variables like current weighted average and absences, the hierarchical model of equa-tion (7) needs to be extended.

Suppose at time t student i has a weighted average grade for course j of cijt. Define vector

ψ_its, which contains the time dependent charac-teristics of student i at time t for course j(i.e. absences).

The current weighted average cijt is an

in-teresting variable, as it does not just have pre-dictive power over the end of year average, but it defines an already fully explained part of the end of year average. Thus the predictive power of any other variable, will only work to-wards predicting variance around the current weighted average.

6_{The Gibbs sampling capabilities of the JAGS}

soft-ware were used in this study.

To predict the end of year average grade gt ij

for student i, course j at time t, both cijt and

ψs_it are added to equation (7). To recognize the invariance of the cijt term, it is included

separately in the equation. The student spe-cific terms from vector ψs

it are included in the

xs

it student characteristic vector. The course

specific terms are included in the xc

jt course

observed variables vector. These modifications result in the following specification:

gijt= cijtχ + xscijt 0 β + xs_it0φc_jt+ xc_jt0φs_it+ eijt, (8) eijt ∼ N (0, σ2), φcjt ∼ N (0, Φ c_{), φ}s it ∼ N (0, Φ s₎

for i = 1 to I, j ∈ Mi and t ∈ (0, 1). The value

of cijt represents the weighted average grade of

student i for course j at time t and χ measures the degree in which cijt directly determines gijt.

The vector xsc

ijt contains the student and course

specific observed characteristics as well as their interactions at time t, β contains the parame-ters that represent the effects of these variables, φs_it are the student specific unobserved differ-ences which interact with the observed course characteristics xc

jt and φcjt are the course

spe-cific unobserved differences which interact with the observed student characteristics xs

it.

This extension of the proposed model is esti-mated using the same MCMC methods and pri-ors and full conditional distributions discussed in the appendix, extended with a proper diffuse prior distribution for χ.

RESEARCH DESIGN

The model discussed in the previous section will be tested within the Dutch High School System. The Dutch school system in general

(14)

consists of six years of primary school, followed by four,five or six years of high school. There is one level of primary school, there are multiple levels of high school.

Two criteria have been used in recent years to determine the level of high school a child is al-lowed to go to. Firstly, there is the teacher’s ad-vice. This advice consists of the advice given by the teacher in the final year of primary school concerning the level that the teachers sees fit for the child. This advice is based on the per-formance of the child in the specific primary school.

Secondly, the CITO test is a test that is developed by the CITO organization and is scientifically designed to test a child’s academic abilities. It was initiated in the Netherlands by A.D. de Groot in 1966 and every school is required to conduct the CITO or a similar test at the end of primary school as of 2014 and most schools have been doing so for longer. The test is developed to provide an objective evaluation of a child’s abilities around the age of 12 and, until 2014, was the main criteria of selection for a child’s level of high school.

The CITO test and teacher’s advice and their relative importance and objectivity have been up for discussion recently. As of 2014, the teacher’s advice has become binding towards the level of high school a child will be able to apply for.

A recent report by the Inspection for Educa-tion states that this change has increased the discrepancy between advice and objective abil-ity, with children with highly educated parents receiving advice above their test scores and the reverse for children with lower educated par-ents. The minister of education recently stated that the CITO test needs to become more

im-portant in the advice on high school level (Volk-skrant, 2016). The preliminary analyses of this study shortly reflects on these issues within the application setting of this paper.

Passing a grade

The prediction of the students’ end of year grades allows for calculation of the probability that a student will pass a certain year of high school. This will allow for the development of a credible Early Warning Indicator system for high schools. To predict the probability of a student passing a certain school-year, the de-tails of the system need to be addressed. The factors that influence an individual student’s performance in high school have been broadly described in the previous, this section consid-ers the level of performance that is required to pass a grade.

In order to pass any specific year of high school, conditions set by the school have to be met. These conditions usually consist of requirements on the end of year average grades for all the student’s courses. The grades in most Dutch high schools are on a scale of 1 to 10. The end of year grades are usually rounded, and a course is failed or ’insufficient’ if the rounded grade is below 6.

The amount of allowed fail marks, i.e. the total points below six, can than be restricted. A school might for example have a student repeat the current year if he or she scores more than two fail marks, which could be a single subject with end of year grade three out of ten, or one with end grade four and one course with a grade of five.

The restrictions are not limited to the amount of fail marks’, there can be

(15)

require-ments on the total average grade and certain subtleties emerge once the students start split-ting up into high school profiles, where different students do a different set of courses from their fourth year on. These profile courses can have special requirements, with usually more impor-tance assigned to the profile courses.

The specific rules a school employs define the probability that is estimated. When for example a student is failing a profile course, this can lead to failing the year directly. If the same student would obtain the same grade for a different course, this would not necessarily mean repeating the grade. Therefore different courses have different levels of importance to the probability of success for individual stu-dents.

It makes sense to start the analysis and pre-dictions for the second grade of high school students. There are three main reasons for this.

First, there is no student segregation in the second year. All second year students follow the same courses and the same rules apply to them, in contrast with the segregation into profiles that occur after year three.

Second, the model can use the students’ first year’s results as input for the estimations. At the start of the second year, the available infor-mation on the student are the observed charac-teristics and their first year grades. Based on these firsts year grades, the model estimates un-observed student specific heterogeneity, which is used towards predicting the second year end of year grades. As the year progresses, through grades, absences and tutoring information will accumulate and the model will be able to use this information towards predicting the proba-bility of a successful second year.

Third, second grade students fail more often

than any of the other grades. In the application of this paper the second year students had to repeat the year more often than in any of the other grades, making it an interesting year for HEWIS to predict.

The Dutch High School

A Dutch high school has kindly cooperated in this study and provided extensive anony-mous data on their students. The school houses around 800 students at any point in time and all these students are of the same high school level; VWO, which is the highest level in the Netherlands. The school has requirements on the total amount of points a student is allowed to score below 6.

At the end of the first year, a student is al-lowed a maximum of two failed marks, which have to be compensated by ’compensation marks’; i.e. points above six for other subjects. So a student with two five’s on their report card needs at least two sevens or an eight.

In the second year, a student is allowed three fail marks where two of those fail marks have to be compensated with compensation points. This results in straightforward calculation of pass probabilities.

The third year becomes a bit more compli-cated, where there is a certain number of ’min-imum value’ points, and ’total points’. The minimum value points consist of the number of points below six for the fourteen ’exam courses’ (courses that will be included at the end of high

school exams in year six) plus the number of fail marks.

As an example; a student with one four and one five for two exam courses has a minimum value point score of 5, as there are three fail

(16)

marks and two insufficient grades for exam courses. The student is then allowed into fourth grade according to Table 2, where a student is always allowed in with four or less minimum value points (for example two fives), and never allowed in with more than nine. Since there are fourteen exam courses, when a student obtains six minimum value points, from for example three fives on their report card, the student needs at least two sevens to compensate and obtain 83 total points. Adding to that, the re-sults a student has obtained for profile courses need to be on average a minimum of 6.0.

From year four to five and five to six, a stu-dent can have no more than three points below six and no single grade below 4.0. For the sub-jects Dutch, English and Mathematics, there can be only one 5.0 and the other two mini-mally 6.0. If a student has more than one point below six, the compensation rule applies and the total average grade needs to be at least six. The profile courses need to be average six.

Year six is graduation year, where a student graduates if they meet national requirements. These requirements state that the average of all exam courses is at least a 5.5 (rounded to 6.0) and the student has either:

1. All average end grades are six or higher 2. One five, all other grades six or higher 3. One four, all the other grades six or higher

and the average of all grades is 6.0 or higher

4. One four and one five, or two fives and the average of all courses is 6.0 or higher. The final grades for the exam courses consist of a ’school exam’ part and a ’central exam’

part, the school exam part being the grade that the student has accumulated on tests that are taken by the school and the central part being the centralized national exams that all students of the specific high school level take in their final year of high school.

One more detail has to be addressed. In the first three years, there are a few courses that are combined into one grade that counts towards passing the grade. The grades for these courses, given that they exceed 4.0, are averaged in the end. To predict the probability of passing the grade, it is sufficient to average them constantly provided they are higher than 4.0. A similar situation exists for years four to five and five to six, where respectively two and three subjects are combined to form one average grade. For graduation, the physical education course only counts as ’sufficient’ and does not get a grade.

Table 2: Allowed into grade 4 Total points ≤ 82 83 84 85 86 ≥ 87 Min. V alue P oin

ts <5 yes yes yes yes yes yes

5 yes yes 6 no yes 7 no no 8 no no no 9 no no no no >9 no no no no no no Year two

For brevity and readability I will limit the dis-cussion of the probability of passing to the second grade students, but the same principle can be applied to all grades. As discussed, a student is allowed three fail marks or points below six which have to be compensated by

(17)

two points above six. Define the average grade of student i for subject j as gij, then the

prob-ability that a student passes a specific subject equals:

P (Student i passes subject j) = P (gij ≥ 5.5)

thus the probability of passing all courses equals the joint probability that all course grades for a student are 5.5 or higher. The predictions of the proposed model will result in full distribution estimates for the student grades. Each student will then obtain a set of distributions for their grades which can be used to calculate pass probabilities.

For the second year, the probability that stu-dent i will pass all their courses is equal to: P (gij > 5.5 for ∀j ∈ Ci)

the probability that student i receives one fail point in total and still passes is equal to: P (4.5 < gik < 5.5 |

gij ≥ 5.5 for ∀j /∈ {k})

where the student fails course k. Define H as the vector of compensation courses (i.e. courses with grade 7 or higher). A student will still pass the year with two fail marks, given that H contains at least one course. The probabilities can be described by:

P (4.5 < gik, gil < 5.5 |

gih ≥ 6.5 for k, l /∈ H with H 6= ∅ ∧

gij ≥ 5.5 for ∀j /∈ {k, l, H})

for two courses with one fail point and: P (3.5 < gik < 4.5 |

gih ≥ 6.5 for k /∈ H with H 6= ∅ ∧

gij ≥ 5.5 for ∀j /∈ {k, H})

for one course with two fail points. The last possibility is three fail points, combined with two compensation points:

P (2.5 < gik < 3.5 |

gih≥ 6.5 for k /∈ H with |H| ≥ 2 ∧

gij ≥ 5.5 for ∀j /∈ {k, H})

for one course with three fail points. The no-tation |H| is used for the length of vector H, which contains the compensation point courses. Three fail points can also be achieved by three fives, or a four and a five:

P (3.5 < gik < 5.5 ∧ 4.5 < gil < 5.5 |

gih≥ 6.5 for k, l /∈ H with |H| ≥ 2 ∧

gij ≥ 5.5 for ∀j /∈ {k, l, H})

for one four and a five, and P (4.5 < gik, gil, gio < 5.5 |

gih≥ 6.5 for k, l, o /∈ H with |H| ≥ 2 ∧

gij ≥ 5.5 for ∀j /∈ {k, l, H})

for three fives.

These probabilities can be calculated directly using the MCMC estimation results. The proce-dure predicts the distribution of the unknown grades by generating a large set of samples for every individual grade. Thousands of these samples are then used to calculate summary statistics for the individual grades (the ’Monte Carlo’ part of MCMC). These samples can also be used to calculate the stated pass probabil-ities, by counting all the sample simulations where the student would have passed the year. The percentage of sample worlds where the stu-dent would have passed the year approximates the expected actual probability.

(18)

APPLICATION

The proposed model will be tested using a large data-set provided by a Dutch High school. In total there are eight years of data available, comprising of 36 different subjects followed by over 1700 unique students (about 51% girls) and 711,653 individual tests. The students were born in 38 different countries, speak 18 differ-ent languages and were taught by 110 differdiffer-ent teachers. Out of the unique students, 326 had some kind of disability while at school, 162 had a non Dutch nationality and 51 students had a serious language barrier. There are 228 ”prior-ity” students that entered the school through their ’priority’ arrangement, which means that at the time of application the student has one or more family members which were already attending the school and thus the applicant gets priority over other students. The number of students with parents that have attended university or higher level academics is 261 and 86% of students were residents of the large city that the school is located in during their time at the Dutch High School.

To incorporate SES in this analysis, nation-wide social status data provided by the Dutch Government was used. The relative socioeco-nomic status of a student using a country-wide ranking of his or her postal code was added to the data set. Learning disabilities that have been confirmed by the school are included in the data-set. The most common learning disabilities in the data are Attention-Deficit/Hyperactivity Disorder (ADHD) and Dyslexia.

The data contains the score of an intelligence test that is taken around age 12, near the end of primary school in the Netherlands. This test,

the CITO score, is used as a selection tool for high-level high schools. The score that is obtained at this test may carry predictive value for general intelligence of a student.

There are three main dimensions to the un-balanced data that are used in the analyses in this paper. Firstly there is the i dimension, i.e. an individual student. Second, there is a j dimension for a specific subject. Third, there is a time dimension representing the moment in time within a year that the test is taken or absence is reported.

There are two main categories of data that are considered in this study. Firstly, there are characteristics like gender, relative age, so-cioeconomic status, learning disabilities and first/second home language for the student and subject and year for courses. These character-istics are assumed not to vary during an aca-demic year. Second, there are variables that are generated over time, which include test grades and absences.

Grades

The data used in this paper contains grades that are on a 1-10 scale. Although easy to in-terpret, there arise some difficulties when using these grades for modeling.

First, there are clear peaks at integer grades and grades on a .5 scale. This is due to teachers grading on an integer or .5 point scale instead of using the continuous possibilities. This be-comes less of a problem with average grades, as they are eventually rounded but fairly con-tinuous during the year.

Second, the student performance is double censored, in the sense that on a 1-10 scale one can score a maximum of 10 and minimum of 1. When predicting the precise end of year

(19)

grade this might become a problem, as pre-dicting grades below 1 or above 10 should be impossible. However, both grades should have some positive probability, as some students do achieve average grades of 10 for specific courses during a year.

Preliminary Analyses

The theory section has discussed some of the predictors towards high school performance. These predictors are considered in this section, modeling observed characteristics for courses and students in simple linear regressions. This preliminary analysis will show the need for a hierarchical approach, as the resulting predic-tions are of little use. The insights of this sec-tion on the observed characteristics will be used towards prediction in the next section.

The effects of the observed variables on the high school performance in the Dutch High School are reported in Table 3. The first col-umn of Table 3 shows that girls clearly outper-form boys which is consistent with the litera-ture in different settings 7_{. Furthermore, the}

higher the weight of a test, the lower the grades that the students achieve. The grades are also lower towards the end of an academic year, as indicated by the significant negative effect of the % year variable. As might be expected, dis-abled students obtain lower grades in general. Interestingly, kids living in Amsterdam obtain slightly lower grades than kids that live outside of Amsterdam. Remarkably, SES does seem to have a significant effect within the school.

Stu-7_{See Rahafar, A., Maghsudloo, M., Farhangnia, S.,}

Vollmer, C., & Randler, C. (2016), Deary, Strand, Smith & Fernandes (2007) and Abott, Hill, Catalano & Hawkins (2000) for examples of gender gap findings in academic achievement.

Figure 1: The effect of SES ranking on student grades

dents with very high SES rankings, i.e. students from very wealthy families, seem to achieve lower grades than their slightly poorer counter-parts (Figure 1). This effect quickly diminishes, there is almost no difference between the mid-dle and lower class SES rankings within this school. This possibly surprising effect is in line with recent news reports of growing inequality in high school advice in the Netherlands8.

The students that were admitted on the ba-sis of the privilege arrangement receive lower grades than non privilege kids, and students with highly educated parents perform signifi-cantly better. Interestingly, students that are not born in the Netherlands actually perform better, although if they do not have Dutch as 8_{Multiple newspapers reported the growing}

inequal-ity between the teachers’ advice for young students and their actual IQ, see for example Kuiper (2016). The re-porting is based on a study by the Dutch Inspectorate of Education (2016) on the state of education in the Netherlands. They find that children from families with high income and education levels receive top level high school advice a lot more often than their lower income peers, even if those children have the same IQ scores.

(20)

Table 3: OLS regressions of individual grades

Dependent variable: Grades

(1) (2) (3) (4)

Baseline + Teachers + Absence cumulative + Absence recent

CITO score 0.087∗∗∗ _(0.0010) _0.085∗∗∗ _(0.0010) _0.084∗∗∗ _(0.0010) _0.086∗∗∗ _(0.0010) log(SES) -0.000032∗∗∗ (0.0000038) -0.000029∗∗∗ (0.0000039) -0.000019∗∗∗ (0.0000038) -0.000023∗∗∗ (0.0000038) SES 0.023∗∗∗ _(0.0027) _0.020∗∗∗ _(0.0028) _0.0084∗∗ _(0.0027) _0.013∗∗∗ _(0.0027) Female 0.44∗∗∗ _(0.0046) _0.46∗∗∗ _(0.0047) _0.39∗∗∗ _(0.0046) _0.40∗∗∗ _(0.0046) % year -0.17∗∗∗ (0.0079) -0.18∗∗∗ (0.0082) 0.30∗∗∗ (0.0090) -0.20∗∗∗ (0.0079) Weight -0.049∗∗∗ _(0.0012) _-0.049∗∗∗ _(0.0012) _-0.050∗∗∗ _(0.0011) _-0.050∗∗∗ _(0.0011) Disability -0.15∗∗∗ _(0.0066) _-0.14∗∗∗ _(0.0070) _-0.091∗∗∗ _(0.0066) _-0.11∗∗∗ _(0.0065) Born in Amsterdam -0.060∗∗∗ (0.0080) -0.064∗∗∗ (0.0083) -0.058∗∗∗ (0.0079) -0.061∗∗∗ (0.0079) Privilege -0.063∗∗∗ _(0.0059) _-0.069∗∗∗ _(0.0061) _-0.014∗ _(0.0058) _-0.034∗∗∗ _(0.0058) HighEd parents 0.20∗∗∗ _(0.0067) _0.20∗∗∗ _(0.0069) _0.14∗∗∗ _(0.0067) _0.16∗∗∗ _(0.0067) Not born in NL 0.091∗∗∗ _(0.0090) _0.085∗∗∗ _(0.0093) _0.094∗∗∗ _(0.0088) _0.088∗∗∗ _(0.0089) NL not 1st language -0.12∗∗∗ _(0.015) _-0.12∗∗∗ _(0.015) _-0.16∗∗∗ _(0.015) _-0.14∗∗∗ _(0.015) Year 2 -0.24∗∗∗ _(0.0071) _-0.21∗∗∗ _(0.0075) _-0.16∗∗∗ _(0.0071) _-0.19∗∗∗ _(0.0071) Year 3 -0.47∗∗∗ _(0.0072) _-0.42∗∗∗ _(0.0077) _-0.32∗∗∗ _(0.0073) _-0.38∗∗∗ _(0.0073) Year 4 -0.52∗∗∗ _(0.0083) _-0.49∗∗∗ _(0.0090) _-0.38∗∗∗ _(0.0085) _-0.43∗∗∗ _(0.0084) Year 5 -0.53∗∗∗ _(0.0091) _-0.50∗∗∗ _(0.0099) _-0.35∗∗∗ _(0.0097) _-0.41∗∗∗ _(0.0094)

Courses 91% Significant 77% Significant 86% Significant 86% Significant

Teachers - - 72% Significant - - -

-Behavior cum -0.048∗∗∗ _(0.0012)

Cutting class cum -0.021∗∗∗ _(0.00065)

Tardiness cum -0.037∗∗∗ _(0.00079) Sickness cum -0.0064∗∗∗ _(0.00012) Medical cum 0.0028∗∗∗ _(0.00068) Home cum 0.0036∗∗∗ _(0.00028) Various cum 0.0041∗∗∗ (0.00032) Transport cum -0.021 (0.011) Behavior recent -0.16∗∗∗ _(0.0045) Ditching recent -0.099∗∗∗ (0.0025) Tardiness recent -0.15∗∗∗ _(0.0029) Sickness recent -0.017∗∗∗ _(0.00041) Medical recent 0.0026 (0.0020) Home recent 0.0083∗∗∗ _(0.00073) Various recent 0.0032∗∗∗ _(0.00083) Transport recent -0.081∗∗ _(0.025) Constant -40.9∗∗∗ _(0.56) _-38.8∗∗∗ _(0.58) _-38.9∗∗∗ _(0.55) _-39.6∗∗∗ _(0.55) N 666095 609341 666095 666095 R2 _0.0582 _0.0735 _0.0802 _0.0733

Robust standard errors in parentheses

(21)

Figure 2: The R2 _{values across time}

their first language this effect vanishes. There is a big difference between grades for different course subjects, with a maximum difference of about 1.8 points in expected grades and t-values ranging from 0 to 36, with 33 of the 36 subjects significant.

The tests the students take are graded by the various teachers in the school. The dataset contains the names of these teachers, thus al-lowing for analysis of differences in teachers’ grading. As considered by Rivkin, Hanushek & Kain (2005), teachers can have a very large im-pact on student performance. There are signif-icant differences in the teachers’ grading, even conditional on the course subjects as demon-strated by the second column in Table 3. Mul-ticolinnearity is introduced by including both courses and teachers, as most teachers are di-rectly linked to one or two courses in the school. Nevertheless, differences between teachers and courses remain highly significant for 77% of the courses and 72% of teachers.

Temporal effects

Considering the temporal effects on educa-tional performance is less straight forward. The main two time dependent factors considered in this analysis are the continuous test grades ac-cumulated and the absences recorded by the school’s staff. There is an obvious connection between the test grades and the average end of year grade. The cumulative weighted average is constantly calculated throughout the year. Considering the average grade of a student for a particular course after 10% of the year has passed, simple regression shows that combined with the control variables, it explains about 41.3 % 9 _{of the variance in the actual end}

of year grades. A lot of uncertainty remains, but this uncertainty decreases over time. As students gain more and more grades, the varia-tion between the current cumulative weighted average and their and of year grades decreases.

Considering the same regression after 20% of the year has passed, the variables and current average grade explain 56% of the variance in end of year grade. After 40% of the year, this number has gone up to 76% and at 80% of the year, 94% of variance has been captured by the grades that have been accumulated com-bined with the fixed student variables. Figure 2 shows the R2 _{progression as time increases}

throughout the year.

The effect of absences throughout the year on the educational performance is investigated in a similar manner. Absences in the school are recorded into eight main categories which are named according to the categories in Table 4. First, considering the effect of the total amount of accumulated absences at any particular point

(22)

Table 4: List of Absence categories Categories Reasons included

Behavioral absences

Sent from class, temporarily sus-pended

Cutting class Cutting class, missed tests, unau-thorised absences

Tardiness Late, overslept

Sickness Sick, school doctor, went home sick

Medical Medical reason, Specialist, Den-tist, General practitioner

Home situa-tion

Domestic circumstances, pro-hibited holiday leave, special events, family circumstances, authorised/unauthorised leave Transport Transport, vehicle breakdowns Other Other, unknown, accepted, study

advice, unclear, no reason, test, missed test

in time on the test scores that the student re-ceives at that time, column 3 of Table 3 shows a simple regression of test scores on absences and their respective categories. The largest effects are recorded by the behavior, ditching class and tardiness categories. All three of these cat-egories have a negative effect on test scores and are highly significant. Sickness has a smaller, but highly significant negative impact on stu-dent test scores. The transport category has a strong negative effect, but is not significant. The fourth column of Table 3 shows the re-sults of the same regression using only recent absences instead of cumulative absences. The effects of behavior, ditching class, tardiness and sickness have increased negative impact, as can be expected, and transport has a lower signif-icance as well as the other categories, which remain at less significant and smaller levels.

These results are upheld by fixed and

ran-dom effects panel data specifications, as can be seen in Table 5. The estimations are corrected for heterogeneity between students and courses, as the panel dimension used is a combination of the student, course and year. In these panel data model specifications, it becomes clear that behavioral, ditching, tardiness, sickness, med-ical and home related absence have a nega-tive impact on the students’ grades. These ef-fects make intuitive sense, missing classes for any reason should not affect grades positively. Therefore the results of transport and various absences are quite surprising, as the first two specifications in Table 5 suggest a positive re-lationship. However, once control variables are introduced in the third specification of Table 5, transport loses some of its significance and magnitude.

The simple OLS regression models of Table 3 can be used toward predicting student grades. Why this is not sufficient becomes clear when considering Table 6. This Table reports the predicted grades and the actual end of year grades for a ’simple’ model specification, where all observed variables for courses and students, and all interactions between these variables are incorporated in a linear regression model. These estimations do not deliver adequate pool-ing towards useful grade prediction, as Table 6 shows that the model predicts almost exclu-sively (86%) 7’s. This gravitation towards the mean calls for a different modeling approach that adequately pools the student and course information to make useful predictions. This is where the hierarchical model shows its strength, the next section will discuss the predictive power of the proposed hierarchical model and shows that it delivers much more valuable pre-dictions of individual student grades.

(23)

Table 5: Absences fixed effects, random effects regressions Dependent variable: cumulative average grade

(1) (2) (3)

Fixed effects Random effects RE + control Behavior -0.012∗∗∗ _(0.00064) _-0.013∗∗∗ _(0.00063) _-0.013∗∗∗ _(0.0015) Ditching -0.0026∗∗∗ (0.00027) -0.0042∗∗∗ (0.00027) -0.0036∗∗∗ (0.00070) Tardiness -0.0055∗∗∗ (0.00043) -0.0080∗∗∗ (0.00042) -0.0073∗∗∗ (0.0011) Sickness -0.0012∗∗∗ (0.000065) -0.0014∗∗∗ (0.000063) -0.0014∗∗∗ (0.00014) Medical -0.0055∗∗∗ (0.00037) -0.0038∗∗∗ (0.00037) -0.0036∗∗∗ (0.00083) Home -0.00079∗∗∗ _(0.00017) _-0.00027 _(0.00016) _-0.00014 _(0.00027) Various 0.0012∗∗∗ (0.00016) 0.0014∗∗∗ (0.00016) 0.0017∗∗∗ (0.00032) Transport 0.043∗∗∗ (0.0068) 0.037∗∗∗ (0.0067) 0.037∗∗ (0.013)

Control var. NO NO YES

Constant 7.04∗∗∗ (0.0012) 7.08∗∗∗ (0.0050) 6.87∗∗∗ (0.086)

N 666341 666341 666341

Robust standard errors in parentheses

∗ _{p < 0.05,}∗∗ _{p < 0.01,}∗∗∗ _{p < 0.001}

Table 6: Predictions year 2014/2015 no heterogeneity Actual grades 4 5 6 7 8 9 10 Total Predicted 6 0 (0.00 ) 1 (0.07 ) 3 (0.22 ) 0 (0.00 ) 0 (0.00 ) 0 (0.00 ) 0 (0.00 ) 4 (0.30) 7 13 (0.97 ) 51 (3.81 ) 285 (21.27 ) 365 (27.24 ) 309 (23.06 ) 118 (8.81 ) 7 (0.52 ) 1148 (85.67) 8 0 (0.00 ) 2 (0.15 ) 21 (1.57 ) 48 (3.58 ) 78 (5.82 ) 35 (2.61 ) 4 (0.30 ) 188 (14.03) Total 13 (0.97 ) 54 (4.03 ) 309 (23.06 ) 413 (30.82 ) 387 (28.88 ) 153 (11.42 ) 11 (0.82 ) 1340 (100.00) Count (percentage)

(24)

PREDICTION

This section applies the proposed hierarchical Bayesian modeling specification to the Dutch high school data-set. The first subsection will tackle inter-year predictions; predicting end of year grades at the absolute beginning of the year (i.e. no grades have yet been accumulated by the students). The second subsection will discuss the intra-year predictions, which predict end of year grades at different points in time throughout the year.

Inter-year

The proposed inter-year hierarchical model of equation (7) is tested using the data for the first three grades from all but the last available year. This allowed for prediction of first, second and third grade’s end of year course averages for the final year (2014/2015) in the data. The predic-tions for the first grade end of year averages in 2014/2015 will use the parameter estimates based on first graders in previous years. The predictions on the second and third grade aver-ages will use the estimates for previous second and third graders, but they can furthermore use the student specific observed and unobserved heterogeneity in the prediction.

To clarify the model in terms of the High School application in this section, equation (7) will be translated in terms of the variables that were available and used in the model. The mod-eling approach for student i, course j of equation

(7) in this setting can be summarized as: gij = β1Yearj+ β2Subjectj+

(9)

β3Demographicsi+ β4Interactionsij+

Student heterogeneity_ij+ Course heterogeneity_ij + eij

for i = 1 to I and j ∈ Mi. Yearj and

Subjectj represent the course specific variables

xc

j, Demographicsi contains the student

ob-served variables xs

i and the interactionsij term

represents the interactions between the course and student specific observed variables. These three terms represent xsc_ijβ in equation (7). The Student and Course heterogeneityij terms

con-tain the unobserved student and course differ-ences, interacted with respectively the course and student variables. These heterogeneities represent the xc_j0φs_i and xs_i0φc_j terms of equation (7). The error term eij is the same error term

as in equation (7), with eij ∼ N (0, σ2).

Yearj contains indicator variables for the

dif-ferent years that the courses fall in. The same subjects are taught across different years and these variables capture the differences between a history class in first grade and third grade. The effect of the β1Yearj term is modeled as:

β1Yearj =

X

h

βhYearjh

(10)

for course j and h ∈ H, with H different years considered. Yearjh = 1 represents course j falls

in year h.

Subjectj contains indicator variables for the

different subjects that courses belong to (i.e.