• No results found

Identifying demographic variables that affect student drop-out rates, using logit and duration models

N/A
N/A
Protected

Academic year: 2021

Share "Identifying demographic variables that affect student drop-out rates, using logit and duration models"

Copied!
48
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Identifying demographic variables that affect student drop-out rates,

using logit and duration models

(2)

Identifying demographic variables that affect student drop-out rates, using logit and duration models

Master thesis, specialisation Marketing Intelligence University of Groningen, Faculty of Economics and Business

Words: 9116

Floris Hegger s1377078

Faculty of Economics and Business Master thesis

June 25th, 2017

Address: Kamerlingh Onnesstraat 112, 9727 HP Groningen, the Netherlands

Phone: +31647004388

Email: florishegger@gmail.com

(3)

Management summary

(4)

Preface

I have worked at Media & Entertainment Management since November 2006. The education grew rapidly in terms of student applications, but the number of dropouts has always been relatively high. This has always bothered me, especially since I could not pinpoint the reason why this was the case. One of the compulsory courses in the Marketing Intelligence track sparked my interest for this topic yet again, which lead to the application of two types of models that we learnt during that course: a logit model and a duration model. Applying both of these models to existing data of the education I work for seemed like a challenging approach in order to decipher whether it is possible to identify possible dropouts at an early stage. I hope that you will enjoy reading it as much as I have spent time writing it. A special thanks goes to my girlfriend who has always been there for me, especially at times when I struggled – her support to me was unconditional and I am forever grateful; my employer who made it possible for me to start this education in the first place and allowed me to be flexible in taking time off so I could attend classes and finish this thesis in time; my parents who have always believed in me and were confident that I would finish this Master’s program; and my supervisor Keyvan Dehmamy who was ever helpful in pointing me in the right direction, assisting me in doing the right analysis, doing the analysis right, and displaying the results accordingly. Without you, I would not have been able to do this, so this is a big thank you for all of you!

(5)

Table of contents

1

Introduction ... 7

1.1

Students in the Netherlands ... 7

1.2

Literature ... 9

1.3

Hypotheses ... 10

1.4

Conceptual model ... 11

1.5

Structure of the thesis ... 11

2

Methods and data ... 12

2.1

Data ... 12

2.2

Logit model ... 14

2.3

Duration model ... 15

2.4

Assumptions ... 16

3

Results ... 17

3.1

Sample description ... 17

3.2

Logit model ... 18

3.3

Cox Proportional-hazards model ... 25

4

Conclusion ... 31

5

Recommendations and managerial implications ... 33

6

Limitations and future research ... 34

7

References ... 35

(6)

List of tables

Table 1 – Overview of existing literature on age, gender and previous education ... 10

Table 2 – List of explanatory variables for each data set ... 13

Table 3 – Demographic characteristics of the five different cohorts and the total sample... 17

Table 4 – Logit models, all students: odds ratios and p-values ... 19

Table 5 – Logit models, all students: model comparison ... 20

Table 6 – Logit models, all students: model fit ... 20

Table 7 – Logit models, HAVO & VWO students: odds ratios and p-values ... 22

Table 8 – Logit models, HAVO & VWO students: model comparison ... 23

Table 9 – Logit models, HAVO & VWO students: model fit ... 23

Table 10 – Hazard models, all students: hazard ratios and p-values ... 25

Table 11 – Hazard models, all students: model comparison ... 27

Table 12 – Hazard models, all students: model fit ... 27

Table 13 – Hazard models, HAVO & VWO students: hazard ratios and p-values ... 28

Table 14 – Hazard models, HAVO & VWO students: model comparison ... 29

Table 15 – Hazard models, HAVO & VWO students: model fit ... 30

Table 16 – Overview of (in)significant parameters in both the logit and duration model ... 31

Table 17 – Overview of rejection or acceptance of hypotheses ... 32

Table 18 – High school attendance in urbanised cities ... 37

List of figures

Figure 1 – Overview of M&EM enrolments, dropouts, and dropout percentage ... 8

Figure 2 – Conceptual model... 11

Figure 3 – Logit model, all students: TDL & Gini coefficient ... 21

Figure 4 – Logit model, HAVO & VWO students: TDL & Gini coefficient ... 24

Figure 5 – Survival function of five different nationality groups ... 26

(7)

7

1 Introduction

Every academic year, universities are faced with dropouts: a term used to indicate students who leave the academic institution prematurely, that is without having obtained an official degree. This situation can be viewed upon from two angles: from the student’s perspective and from the academic

institution’s perspective. Students may use departure, withdrawal, (academic) failure or non-completion as synonyms to describe that they are discontinuing their studies at that particular education. Similarly, institutions often refer to the student attrition rate. Students changing from one studies to another are technically referrals, but it is often unknown to the education itself what happens to the student after (s)he has deregistered. Therefore, these students will also be considered dropouts for the remainder of this paper.

Dropouts can be divided into voluntary and involuntary dropouts; students that voluntarily choose to abandon the studies do so because they e.g. are no longer motivated or lack the necessary study skills. On the other hand, students may be forced to quit their studies, because they did not meet the university’s standards. In the Netherlands, students have to obtain a certain number of credits (ECTS) in their first year in order to be allowed into the second year. Failing to do so will result in the student being deregistered. Discontinuing the studies due to financial problems will be seen as voluntary dropouts in this case, since it was not forced upon by the university.

1.1 Students in the Netherlands

The number of students starting a studies in the Netherlands has been growing for the past 10 years: from 562,728 in 2005 to 703,743 in 2015, an increase of 25.1% (CBS statline, 2017). Both Dutch and foreign students show an increase in applications; in fact, the year 2016 showed the highest number of international students studying in the Netherlands: over 112,000 (Nuffic, 2017). However, not all of these students graduate for various reasons: lack of motivation, difficulties with the educational system, personal circumstances, insufficient study support, to name the most important (Wartenberg & van den Broek, 2008). Since most students quit their studies in their first year (Vereniging

Hogescholen, 2016), it is most beneficial to identify these students early in the first year of their studies, so that programs can be developed targeting these students, in order to minimise the number of dropouts.

The Dutch system of higher education is divided into two major branches: universities and universities of applied sciences [hereafter named HBO1], a division that is uncommon in most other countries. More than 60% of the students are registered at a university of applied science, which is also the focus of this thesis. The dropout rates in all universities of applied sciences in the Netherlands has been relatively constant over the past years at 15-16% (Vereniging Hogescholen, 2016), but not every education is that consistent. For this thesis, data of one education in the Netherlands will be analysed: Media & Entertainment Management [M&EM] from Stenden University of applied sciences. Stenden is an HBO institution situated in Leeuwarden that has around 2500 to 3000 applications every year and offers over 20 different educations (such as Hotel Management, Media & Entertainment

Management, Leisure Management, Tourism Management, …). They mainly attract students from the northern part of the Netherlands (i.e. Friesland, Groningen, and Drenthe), but international students

(8)

8 also take up a fair deal of their enrolments. The amount of international students choosing to study M&EM has been growing over the past years to nearly 50% in 2016. However, the dropout rate of M&EM is relatively high, especially compared to the Dutch average. Figure 1 shows the overview of the dropout rate at M&EM, which is above 20% every year and sometimes even over 30%. The dropout rate fluctuates quite a bit over time, but is consistently higher than the average (15-16%) in the Netherlands.

Figure 1 – Overview of M&EM enrolments, dropouts, and dropout percentage

(9)

9

1.2 Literature

Universities have been around for centuries, and over the years researchers have tried to indicate variables that predict academic performance. To name a few, previous education (McKenzie & Schweitzer, 2001; Sladek et al., 2016), study skills (Cone & Owens, 1991), university satisfaction (Wince & Bordon, 1995), and class attendance (Burrus & Roberts, 2012; Calderon et al., 2009; Franklin & Trouard, 2014) have all been found to affect academic performance. In addition to these (social and cognitive) variables, demographic variables have also been found to affect academic performance (Amuda, Bulus & Joseph, 2016; Burrus & Roberts, 2012; Calderon et al., 2009; Franklin & Trouard, 2014; Naderi et al., 2009). However, there is no clear conclusion about how they may affect the performance. Age, for example could be not significant (Amuda, Bulus & Joseph, 2016; Ebenuwa-Okoh, 2010) or it could have a negative effect: older students are more likely to drop out (Burrus & Roberts, 2012; Calderon et al., 2009; Franklin & Trouard, 2014). The same can be concluded about gender; some sources (Naderi et al., 2009; Ebenuwa-Okoh, 2010) indicate that gender is not a significant predictor of academic performance, while others claim that male students are more likely to drop out early (Burrus & Roberts, 2012; Franklin & Trouard, 2014).

Table 2 summarises the conclusions of several papers for three variables: age, gender, and previous education. The latter should be interpreted as ‘the average grade on the previous education’, in the United States also known as ‘grade point average’ [GPA]. There is no universal system as to indicate the average on the previous education, and since this paper focuses mainly on students in the Netherlands, the average that is common in the Netherlands was used (based on a scale from 1 (very bad) to 10 (very good), where 5.5 is the so-called pass mark). Moreover, grades from foreign students were unfortunately not available to the researcher, which made the conclusion to only use Dutch grades obvious.

Most of the researches mentioned in Table 1 use the term academic performance, which is not necessarily an equivalent term for attrition rate. Students dropping out of the study could be a result

of bad academic performance, but it may also be the result of other factors, such as a bad financial

situation or health-related problems. A student might be having good grades, but could be forced to quit the studies since the solvency of the student no longer allows him to continue studying.

(10)

10

Age Gender Previous education

McKenzie & Schweitzer (2001)

Lower grades more likely to drop out Amuda, Bulus & Joseph

(2016) Not significant

Burrus & Roberts (2012) Older more likely to drop out

Male students more likely to drop out Calderon et al. (2009) Older more likely to

drop out

Franklin & Trouard (2014) Older more likely to drop out

Male students more likely to drop out

Naderi et al. (2009) Not significant

Sladek et al. (2016) Lower grades more

likely to drop out Ebenuwa-Okoh (2010) Not significant Not significant

Table 1 – Overview of existing literature on age, gender and previous education

1.3 Hypotheses

Even though it is not clear whether age and gender have an effect on academic performance, they were included in the hypotheses based on the outcomes of the table above. Grades from the previous education have a clearer direction as the found sources are unanimous in the effect. The following hypotheses have been formulated:

H1A: Older students are more likely to drop out of the studies during the first year compared to younger students.

H1B: Older students are more likely to drop out of the studies sooner during the first year compared to younger students.

H2A: Male students are more likely to drop out the studies during the first year compared to female students.

H2B: Male students are more likely to drop out of the studies sooner during the first year compared to female students.

H3A: Dutch students who have a lower average from their previous education (HAVO/VWO) are more likely to drop out of the studies during the first year compared to Dutch students with a higher average.

(11)

11

1.4 Conceptual model

1.5 Structure of the thesis

The next chapter explains why the logit and hazard model are used for analysing the data and how the data was obtained. Logit models are typically used for binary dependent variables, where hazard models are used when the dependent variable is a duration variable. Chapter three will then show the outcomes of the analysis of both models, in which a distinction has been made between all students and Dutch (HAVO & VWO) students only. The main reason for this was the absence of grades of the previous education from international students. The fourth chapter will present the conclusion in which the objective will be discussed, while the fifth chapter presents recommendations and managerial implications. The last chapter will then discuss the limitations of the chosen approach as well as well as give indications for further research.

Dropping out of studies? (Y/N and when) Gender

Age

Previous education

(12)

12

2 Methods and data

In order to analyse whether it is possible to predict whether a student quits the studies during the first year or a not, a binary choice model must be applied. Binary choice models are specifically introduced when the dependent variable is binary (Leeflang, Wieringa, Bijmolt & Pauwels, 2015); in the case of this thesis: does a student quit (Y) or not (N)? Since this is not the only question that is to be answered here, another model is applied too: the duration model (also named ‘hazard model’). This model is applied when the dependent variable is time dependent; in the case of this thesis: when does a student quit the studies? Since variables may have different explanatory power (e.g. gender may be a good predictor whether a student quits but not when a student quits), both models are estimated and interpreted. This also means that the hypotheses formulated in the introduction will be tested for both models.

2.1 Data

The data for this thesis was obtained from Stenden Hogeschool Leeuwarden, in particular for the education ‘Media & Entertainment Management’ [M&EM]. This education was founded in September 2002 and had a slow but steady increase in enrolments, until 2009 (see Figure 1, page 8). After that, the number of enrolments dropped a little bit, but remained quite steady at roughly 200-250 new students each academic year. In total, the dataset contains 1331 unique students, divided over 5 different academic years. The dataset was adjusted so that for each row there is a binary variable indicating whether the student quit the studies in year 1, and a second column stating the month the student deregistered. For example, if a student started studying in September 2014 and deregistered in February 2015, the indicated month will be 6 in this case. Since many students do not quit during the first year, there are many right-censored observations. Since M&EM has a binding study advice at the end of year one – at least 51 credits need to be obtained, otherwise students won’t be allowed to start the 2nd year – the dropout will be particularly high at the end of each academic year. Very few students quit the studies after their first year, so it was not beneficial to predict dropouts after the foundation year.

M&EM is an education that can be followed in either Dutch or English. All international students follow the English stream, where most Dutch students follow the Dutch stream (but not all). Of the Dutch students whose prior education was either HAVO or VWO, grades were available from their previous education. The grades from all subjects were averaged, so that the one grade would be used as explanatory variable. Unfortunately, these data were not present for all other Dutch students (e.g. coming from vocational education) and all international students. The dataset was therefore split into two: one containing only Dutch students whose prior education was HAVO or VWO (N=608) and one dataset containing all students (N=1331). Furthermore, for the dataset containing only Dutch students, additional data was added about the location (city and province) of their previous

education. CBS data was used to link the city and province to the school, based on the ZIP-code that was available. Finally, the city of the high school was linked to the degree of urbanisation (according to CBS data). For example, the city of Groningen is very urbanised, so high schools from this town would be labelled as urbanised, whereas a high school from the town of Haren (a small town about 7 kilometres from Groningen) was labelled as non-urbanised.

(13)

13

Explanatory variables ↓

Abbreviation

variable Value labels

All students (Dutch + International) Only Dutch students (HAVO + VWO)

Age age Continuous; ranging from

16 to 29

Gender gender 0 = Male

1 = Female

Starting month start 0 = February

1 = September

Stream stream 0 = International

1 = Dutch

Application apply Continuous; ranging from 0 to 365 Nationality nat 0 = German 1 = Dutch 2 = Other west-European 3 = East-European 4 = Other Previous education previous 0 = HAVO 1 = VWO 2 = MBO 3 = Foreign diploma 4 = Other Average previous education average

Continuous; ranging from 1 to 10 Profile previous education profile 0 = E+M 1 = C+M 2 = E+M & C+M 4 = N+G and/or N+T Province previous education province 0 = Friesland 1 = Groningen 2 = Drenthe 3 = Overijssel 4 = Other Urbanisation urbanisation 0 = Non-urbanised

1 = Urbanised

Table 2 – List of explanatory variables for each data set

Since all Dutch students have the Dutch nationality, this variable was not included in the analysis for the HAVO & VWO students. There are 12 provinces in the Netherlands, but due to a lack of

(14)

14 education, program previous education and province) were used as dummy variables, so that the analysis could be run properly. For nationality, ‘German’ was used as benchmark category, for previous education ‘HAVO’, for program previous education ‘E+M’, and ‘Groningen’ was used as the benchmark province. In Dutch high schools, there are four types of profiles students can choose from: E+M (economics & society), C+M (culture & society), N+T (nature & science), and N+G (nature & health). Some students choose two profiles rather than one and the nature programs are mostly chosen by students who wish to study physics, chemistry or something healthcare related, so these profiles are not chosen often by students who wish to study M&EM. Therefore, four categories remained.

2.2 Logit model

As the dependent variable in this model is binary (e.g. the student quits or continues studying), a logit or probit model must be used, according to the following format (Leeflang et al., 2015):

𝑌𝑖 = 𝛽0+ 𝛽1𝑥𝑖+ 𝜀𝑖, 𝑖 = 1, … , 𝑁

where 𝑌𝑖 is the dependent variable (either 0 or 1), 𝛽0 is an unknown intercept, 𝛽1 is an unknown slope

parameter, 𝑥𝑖 is an independent variable, 𝜀𝑖 is the (unobserved) value of the disturbance term, and N

is the number of observations.

The described literature from the previous chapter gives some indication of possible explanatory factors, but there are other variables (as outlined in Table 2) that could explain the dropout rates of students. To understand which variables predict students’ dropout, all variables from Table 2 will be entered into the model (1), after which the variables from the theory (gender, age, average from previous education) plus the significant variables will be kept in model (2), concluding with a model (3) that only contains significant variables. Since some variables are not applicable to the entire dataset, the analysis will be done twice (once for the entire dataset, and once for the Dutch students only), resulting in the following two formulas:

All students:

𝑄𝑢𝑖𝑡𝑖 = 𝛽0+ 𝛽1𝑎𝑔𝑒𝑖+ 𝛽2𝑔𝑒𝑛𝑑𝑒𝑟𝑖+ 𝛽3𝑠𝑡𝑎𝑟𝑡𝑖+ 𝛽4𝑠𝑡𝑟𝑒𝑎𝑚𝑖+ 𝛽5𝑎𝑝𝑝𝑙𝑦𝑖+ 𝛽6𝑛𝑎𝑡𝐷𝑢𝑡𝑐ℎ𝑖

+ 𝛽7𝑛𝑎𝑡𝑤𝑒𝑠𝑡𝑖+ 𝛽8𝑛𝑎𝑡𝑒𝑎𝑠𝑡𝑖+ 𝛽9𝑛𝑎𝑡𝑜𝑡ℎ𝑒𝑟𝑖+ 𝛽10𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑉𝑊𝑂𝑖+ 𝛽11𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑀𝐵𝑂𝑖 + 𝛽12𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑓𝑜𝑟𝑒𝑖𝑔𝑛𝑖+ 𝛽13𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑜𝑡ℎ𝑒𝑟𝑖+ 𝜀𝑖, 𝑖 = 1, … , 𝑁

Dutch (HAVO & VWO) only:

(15)

15 To calculate the predicted probability that a student quits, the following formula has to be applied: 𝜋𝑖 =

𝑒𝑋𝑖′𝛽

1+𝑒𝑋𝑖′𝛽, where 𝜋𝑖 is the predicted probability of student i, 𝑋𝑖 is a matrix of observations of the

independent variables for student i, and β is a vector of parameters (Leeflang et al., 2015). As mentioned above, logit and probit models are both possible to use for an analysis such as this. However, parameters in the probit model are more difficult to interpret (Leeflang et al., 2015), which is the main reason why the logit model is preferred over the probit model.

2.3 Duration model

A second model that will be used in this analysis, is the Cox Proportional-hazards model (a type of duration model), which typically answers questions such as ‘when does a customer adopt our product?’ or ‘when does a customer churn?’ In the upcoming analysis the event ‘when will the student quit the studies?’ will be analysed. Since a student can quit the studies after the first 12 months of the studies, many observations will be right-censored (Leeflang et al., 2015). The base formula for a duration model is (Leeflang et al., 2015):

ℎ𝑖(𝑡) = ℎ0(𝑡)𝑒𝛽1𝑥𝑖, 𝑖 = 1, … , 𝑁

where ℎ𝑖(𝑡) is the probability that a student (i) quits at time (t) given that (s)he has not quit yet; ℎ0(𝑡)

is the hazard base line (so the probability a student quits at a certain time (t)); 𝑥𝑖 is an independent

variable; and 𝛽1 is an unknown parameter for the independent variable.

Since deregistration of students only takes place at the end of each month, the time variable in this case is discrete (taking values ranging from 1 to 12). Again, two models will be proposed, one for all students and one for Dutch (HAVO & VWO) only.

All students:

ℎ𝑖(𝑡) = ℎ0(𝑡)𝑒𝛽1𝑎𝑔𝑒𝑖𝑒𝛽2𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑒𝛽3𝑠𝑡𝑎𝑟𝑡𝑖𝑒𝛽4𝑠𝑡𝑟𝑒𝑎𝑚𝑖𝑒𝛽5𝑎𝑝𝑝𝑙𝑦𝑖𝑒𝛽6𝑛𝑎𝑡𝐷𝑢𝑡𝑐ℎ𝑖𝑒𝛽7𝑛𝑎𝑡𝑤𝑒𝑠𝑡𝑖𝑒𝛽8𝑛𝑎𝑡𝑒𝑎𝑠𝑡𝑖

𝑒𝛽9𝑛𝑎𝑡𝑜𝑡ℎ𝑒𝑟𝑖𝑒𝛽10𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑉𝑊𝑂𝑖𝑒𝛽11𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑀𝐵𝑂𝑖𝑒𝛽12𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑓𝑜𝑟𝑒𝑖𝑔𝑛𝑖𝑒𝛽13𝑝𝑟𝑒𝑣𝑖𝑜𝑢𝑠𝑜𝑡ℎ𝑒𝑟𝑖, 𝑖 = 1, … , 𝑁

Dutch (HAVO & VWO) only:

ℎ𝑖(𝑡) = ℎ0(𝑡)𝑒𝛽1𝑎𝑔𝑒𝑖𝑒𝛽2𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑒𝛽3𝑠𝑡𝑎𝑟𝑡𝑖𝑒𝛽4𝑠𝑡𝑟𝑒𝑎𝑚𝑖𝑒𝛽5𝑎𝑝𝑝𝑙𝑦𝑖𝑒𝛽6𝑎𝑣𝑒𝑟𝑎𝑔𝑒𝑖𝑒𝛽7𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝐶+𝑀𝑖

𝑒𝛽8𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝐸+𝑀&𝐶+𝑀𝑖𝑒𝛽9𝑝𝑟𝑜𝑓𝑖𝑙𝑒𝑁+𝑇/𝑁+𝐺𝑖𝑒𝛽10𝑝𝑟𝑜𝑣𝑖𝑛𝑐𝑒𝐺𝑟𝑜𝑛𝑖𝑛𝑔𝑒𝑛𝑖𝑒𝛽11𝑝𝑟𝑜𝑣𝑖𝑛𝑐𝑒𝐷𝑟𝑒𝑛𝑡ℎ𝑒𝑖𝑒𝛽12𝑝𝑟𝑜𝑣𝑖𝑛𝑐𝑒𝑂𝑣𝑒𝑟𝑖𝑗𝑠𝑠𝑒𝑙𝑖 𝑒𝛽13𝑝𝑟𝑜𝑣𝑖𝑛𝑐𝑖𝑒𝑜𝑡ℎ𝑒𝑟𝑖𝑒𝛽14𝑢𝑟𝑏𝑎𝑛𝑖𝑠𝑎𝑡𝑖𝑜𝑛𝑖, 𝑖 = 1, … , 𝑁

The explanatory variables in this model do not change over time, so they are considered constant

over time. Obviously, age does change over time, but in the case of this model, we were only

(16)

16

2.4 Assumptions

In contrast to most linear regression models, binary logistic regression models do not have many restrictions. Linear regression models have assumptions such as: ‘data are normally distributed’, ‘equal variances are assumed’, ‘error term is normally distributed’, to name a few (Malhotra, 2010). The analysis cannot be done without properly checking some aspects, however. The first prerequisite for applying logistic regression is that the dependent variable is binary (Leeflang et al., 2015). This is true for this analysis as the dependent variable (does the student quit?) can only take on two forms: yes or no. A second assumption is that there are no outliers in the continuous explanatory variables (Tabachnick & Fidell, 2012). This can be checked by standardising these variables and removing observations below the z-score of -3.29 or above the z-score of 3.29. 14 outliers were found for age (27-29 years of age). It was decided to remove these students from the database as starting to study at this age is rather rare and would not reflect the average student of M&EM well. The resulting datasets consist of 1331 students (all) and 608 (only Dutch students). 4 outliers were also found for the average from the previous education, but they were left in the analysis as the z-scores were only slightly above the threshold and averages of these (7.7 and 7.8) are possible for high school students in the Netherlands. Removing them would result in a loss of data. A third assumption is that there should not be any multicollinearity between the explanatory variables. This can be checked by making a correlation matrix between all variables and find correlations below -0.9 or above 0.9 (Tabachnick & Fidell, 2012). There was a high correlation between nationality and previous education (0.93); foreign students always have a ‘foreign degree’. It was decided to remove the previous degree from the equation as all Dutch students always have either HAVO, VWO or MBO (these are impossible for foreign students). In other words, these variables are too highly correlated so that one variable is enough. This means that the final formulas are also slightly adjusted (and can be found at the bottom of this paragraph). Finally, Agresti and Kateri (2011) argue that logistic regression models need larger sample sizes than OLS models. They advise at least 10 respondents per independent variable, but ideally a model has 30 respondents per independent variable. The models introduced in this paper have 14 variables (N = 608) and 9 variables (N = 1331) and thereby meet the requirement of at least 30 observations per independent variable. Conclusively, it can be said that all assumptions are met after some adjustments, so that the analyses can now be run properly.

Adjusted logit formula for all students:

𝑄𝑢𝑖𝑡𝑖 = 𝛽0+ 𝛽1𝑎𝑔𝑒𝑖+ 𝛽2𝑔𝑒𝑛𝑑𝑒𝑟𝑖+ 𝛽3𝑠𝑡𝑎𝑟𝑡𝑖+ 𝛽4𝑠𝑡𝑟𝑒𝑎𝑚𝑖+ 𝛽5𝑎𝑝𝑝𝑙𝑦𝑖+ 𝛽6𝑛𝑎𝑡𝐷𝑢𝑡𝑐ℎ𝑖+

𝛽7𝑛𝑎𝑡𝑤𝑒𝑠𝑡𝑖+ 𝛽8𝑛𝑎𝑡𝑒𝑎𝑠𝑡𝑖+ 𝛽9𝑛𝑎𝑡𝑜𝑡ℎ𝑒𝑟𝑖+ 𝜀𝑖, 𝑖 = 1, … , 𝑁

Adjusted duration formula for all students:

ℎ𝑖(𝑡) = ℎ0(𝑡)𝑒𝛽1𝑎𝑔𝑒𝑖𝑒𝛽2𝑔𝑒𝑛𝑑𝑒𝑟𝑖𝑒𝛽3𝑠𝑡𝑎𝑟𝑡𝑖𝑒𝛽4𝑠𝑡𝑟𝑒𝑎𝑚𝑖𝑒𝛽5𝑎𝑝𝑝𝑙𝑦𝑖𝑒𝛽6𝑛𝑎𝑡𝐷𝑢𝑡𝑐ℎ𝑖𝑒𝛽7𝑛𝑎𝑡𝑤𝑒𝑠𝑡𝑖𝑒𝛽8𝑛𝑎𝑡𝑒𝑎𝑠𝑡𝑖

(17)

17

3 Results

3.1 Sample description

The sample consists of 1331 students, divided over 5 different years (2011, 2012, 2013, 2014, and 2015). Students from MEM are more often female (57.0%) than male (43.0%), while the average age at the start of the studies is 19.6 (SD = 1.9). The education can be followed in both Dutch (54.9%) and English (45.1%) – obviously, all foreign students choose the English stream, while just over a fifth (21.4%) of the Dutch students choose the English variant. Germany delivers the most non-Dutch students (17.7%), followed by Bulgaria (2.6%) and Romania (1.6%) (see the table below). The latter two are taken together with other Eastern European countries. Both German (-49.2%) and Dutch (-44.2%) students show a decline in both absolute numbers and percentages over the five years, but this is mostly due to the large drop in enrolments in the last cohort (-34.0% compared to 2014).

2011 (N = 295) 2012 (N = 274) 2013 (N = 294) 2014 (N = 282) 2015 (N = 186) Total (N = 1331) Age (mean, SD) (2.10) 19.64 (1.99) 19.64 (1.75) 19.51 (1.92) 19.61 (1.91) 19.44 (1.94) 19.58 Gender (n, %female) (55.6%) 164 (56.9%) 156 (55.1%) 162 (56.4%) 159 (63.4%) 118 (57.0%) 759 Start (n, %February) 53 (18.0%) 53 (19.3%) 48 (16.3%) 69 (24.5%) 30 (16.1%) 253 (19.0%) Stream (n, %Dutch) (61.7%) 182 (59.9%) 164 (46.3%) 136 (55.0%) 155 (50.5%) 94 (54.9%) 731 Dropout (n, %) (27.1%) 80 (27.7%) 76 (25.5%) 75 (36.9%) 104 (22.0%) 41 (28.2%) 376 Nationality (n, %) Dutch (73.6%) 217 (70.8%) 194 (66.3%) 195 (70.2%) 198 (65.1%) 121 (69.5%) 925 German (20.0%) 59 (19.0%) 52 (16.7%) 49 (16.0%) 45 (16.1%) 30 (17.7%) 235 Other west-European 4 (1.4%) 2 (0.7%) 5 (1.7%) 5 (1.8%) 4 (2.2%) 20 (1.5%) East-European (4.1%) 12 (7.7%) 21 (11.2%) 33 (10.3%) 29 (12.9%) 24 (8.9%) 119 Other (1.0%) 3 (1.8%) 5 (4.1%) 12 (1.8%) 5 (3.8%) 7 (2.4%) 32

Table 3 – Demographic characteristics of the five different cohorts and the total sample

Students can start with the education in either September (81.0%) or February (19.0%). The

(18)

18

3.2 Logit model

The analysis that follows will first present a model (1) with all possible explanatory variables (both from theory and the dataset itself). After that, a model (2) will be presented including the variables from the theory (gender, age, average previous education – highlighted in red in the tables) as well as all significant variables from the dataset. The final model (3) will only include significant variables. The backward Wald technique was used to determine which variables would be included in the final model, using an alpha of 10%. Finally, the model fit will be discussed using the hit rate, log likelihood, as well as the Cox & Snell R2, Nagelkerke R2, and McFadden R

2. The hit rate will be calculated using the threshold value (Wooldridge, 2012) of the sample: 28.2% for the N = 1331 dataset, and 27.1% for the N = 608 dataset. Each model will be compared to the constant only model based on the -2 log likelihood, and they will be compared to each other as well. The comparison will be done by calculating chi-square values and comparing the difference between them as well as comparing the difference in degrees of freedom. Based on this, a p-value will be calculated which will indicate which model is preferred. If a value is significant, the model with more parameters is better; in case the p-value is insignificant, the model with fewer parameters is preferred. As explained in the previous chapter, the analysis will be done twice: first for all students, and additionally for only the Dutch students whose prior education was either HAVO or VWO.

3.2.1 Logit model – all students

(19)

19

Model 1 Model 2 Model 3

Odds ratio Wald

(p-value) Odds ratio

Wald

(p-value) Odds ratio

Wald (p-value) Constant 0.092 7.190 (0.007)*** 0.087 7.807 (0.005)*** 0.311 24.466 (0.000)*** Age 1.063 2.473 (0.116) 1.065 2.641 (0.104) Gender 0.853 1.161 (0.281) 0.854 1.146 (0.284) Starting month 0.949 0.073 (0.786) Stream (Dutch vs English) 1.054 0.067 (0.795) Days prior to start of

studies 0.998 2.999 (0.083)* 0.998 3.581 (0.058)* 0.997 5.984 (0.014)** Nationality2 12.404 (0.015)** 13.518 (0.009)*** 11.534 (0.021)** Nationality –Dutch 1.974 6.027 (0.014)** 2.037 10.224 (0.001)*** 1.871 8.387 (0.004)*** Nationality –other west-European 1.995 1.177 (0.278) 1.977 1.148 (0.284) 2.012 1.213 (0.271) Nationality –East-European 1.853 11.683 (0.001)*** 2.816 11.64 (0.001)*** 2.577 10.024 (0.002)*** Nationality –Other 1.974 1.529 (0.216) 1.864 1.567 (0.211) 1.772 1.326 (0.250) * = significant at 0.10 level, ** = significant at 0.05 level, *** = significant at 0.01 level

Table 4 – Logit models, all students: odds ratios and p-values

(20)

20 Model choice

The table below shows the chi-square values when the models are compared with one another. The first row shows the comparison of each model to the null model, that is the model with only the constant. These tests are all significant, meaning that each model performs better than the null model. However, when comparing model 2 with model 1 and subsequently model 3 with model 2, the tests show that there is no significant difference. The interpretation of this is that the model with fewer parameters is then preferred, as adding more parameters does not significantly improve the model. Therefore, model 3 is the preferred model based on the chi-square statistics.

Model 1 Model 2 Model 3

Null model 𝜒 2= 24.716, df = 9, p = 0.003 𝜒2= 24.581, df = 7, p = 0.001 𝜒2= 19.668, df = 5, p = 0.001 Model 1 𝜒 2= 0.135, df = 2, p = 0.935 Model 2 𝜒 2= 4.913, df = 2, p = 0.086

Table 5 – Logit models, all students: model comparison

Model fit and performance

The predictive power of this model is not very good, which can be seen in the table below. For the out of sample set, the hit rate of the naive model is the same as for the predictive model, meaning that our model performs as good as a random model. The hit rate for the in sample set is only marginally better in the first two models, but equal to the naive model in the 3rd model. In Table 4, some parameters showed significant influence, meaning that the model is good in descriptive power, but the predictions of this model are not that good. In addition, the calculated R2s are quite low too, which is another indication of underperformance of the model (see Table 6).

Model 1 Model 2 Model 3

In sample Out of sample In sample Out of sample In sample Out of sample -2 log likelihood 1183.840 370.023 1183.975 370.361 1188.888 370.916

Hit rate – naïve 70.8% 74.5% 70.8% 74.5% 70.8% 74.5%

Hit rate 70.9% 74.5% 70.9% 74.5% 70.8% 74.5%

Cox & Snell R2 0.024 0.013 0.024 0.012 0.019 0.011

Nagelkerke R2 0.035 0.019 0.035 0.018 0.028 0.015

McFadden R2 0.020 0.012 0.020 0.011 0.016 0.009

In sample null model: -2 log likelihood = 1208.556 Out of sample null model: -2 log likelihood = 374.401

(21)

21 Another way of looking at the predictive power of a model, is the Top Decile Lift [TDL] and the Gini coefficient. The TDL is a measure that is often used in logit, probit and classification tree models (Blattberg, Kim & Neslin, 2008) and is defined as “… the fraction of churners in the top-decile divided by the fraction of churners in the whole set” (Blattberg, Kim & Neslin, 2008: 263). The power of this measure is that it tries to identify those customers that have a high probability to churn, or in the light of this thesis: the TDL depicts those students that are most likely to quit the studies in year one (e.g. the ‘students at risk’. A second measure that will be used to describe the predictive power is the Gini coefficient, which takes into account the overall performance of the model. It is used to compare the quality of a model-based selection with a random selection of students. The Gini coefficient is “… calculated by dividing the area between the cumulative lift curve and the 45-degree line by the area under the 45-degree line” (Blattberg, Kim & Neslin, 2008: 319). Both the TDL and the Gini coefficients should be large to indicate that the model has good predictive power. However, the Gini coefficient always ranges between 0 and 1.

Figure 3 – Logit model, all students: TDL & Gini coefficient

(22)

22

3.2.2 Logit model – Dutch HAVO & VWO students only

The interpretation of the table below can be found on the next page.

Model 1 Model 2 Model 3

Odds ratio Wald

(p-value) Odds ratio

Wald

(p-value) Odds ratio

Wald (p-value) Constant 5043.512 8.562 (0.003)*** 1449.103 7.107 (0.008)*** 417.007 8.170 (0.004)*** Age 0.887 1.869 (0.172) 0.941 0.569 (0.450) Gender 0.748 1.334 (0.248) 0.777 1.186 (0.276) Starting month 0.608 2.207 (0.137) 0.527 4.303 (0.038)** 0.546 3.931 (0.047)** Stream (Dutch vs English) 0.737 1.209 (0.271) Days prior to start of

studies 0.997 2.080 (0.149) Average previous education 0.359 8.765 (0.003)*** 0.362 9.512 (0.002)*** 0.358 9.878 (0.002)*** Profile3 4.951 (0.175) Profile –C+M 1.202 0.433 (0.511) Profile –E+M & C+M 1.522 1.163 (0.281) Profile –N+G and/or N+T 1.961 4.525 (0.033)** Province4 1.512 (0.824) Province –Groningen 0.861 0.13 (0.719) Province –Drenthe 1.394 0.714 (0.398) Province –Overijssel 1.251 0.404 (0.525) Province –Other 1.142 0.156 (0.693) Urbanisation 2.240 4.247 (0.039)** 1.719 3.455 (0.063)* 1.612 2.834 (0.092)* * = significant at 0.10 level, ** = significant at 0.05 level, *** = significant at 0.01 level

Table 7 – Logit models, HAVO & VWO students: odds ratios and p-values

(23)

23 Looking at the Dutch HAVO & VWO students, it can be concluded that three variables are significant: starting month, average from previous education, and urbanisation. Dutch students starting in September have a lower odds of dropping out of the studies during the first year compared to students starting in February. The average from their HAVO or VWO high school is also significant: if a student has a higher average it means it is less likely that the student drops out. Finally, urbanisation is a significant variable: students from big cities are more likely to quit the studies than students who attended a high school in a less urbanised city.

Model choice

All models perform better than the ‘constant-only model’ (see Table 8). However, comparing model 2 with model 1 does not show a significant chi-square value, which means that model 2 (with fewer parameters) is preferred. The same conclusion can be drawn when comparing models 2 and 3: model 3 has fewer parameters and is therefore preferred over model 2. Therefore, based on these statistics, model 3 with three significant parameters is preferred.

Model 1 Model 2 Model 3

Null model 𝜒 2= 30.877, df = 14, p = 0.006 𝜒2= 23.055, df = 6, p = 0.001 𝜒2= 20.268, df = 3, p = 0.000 Model 1 𝜒 2= 7.822, df = 8, p = 0.451 Model 2 𝜒 2= 2.787, df = 3, p = 0.426

Table 8 – Logit models, HAVO & VWO students: model comparison

Model fit and performance

Again, the hit rates of the predictive model are not very good. The out of sample set outperforms the naive model only in the first model, but after that the percentages are equal to that of the random model. Looking at the hit rates of the in sample set, we observe a slightly higher predictive hit rate in the last model, but all in all this model is not much better than a ‘random-guess model’. The reported R2s are also very low, which – again – is another indication that these models do not have much predictive power (see table below).

Model 1 Model 2 Model 3

In sample Out of sample In sample Out of sample In sample Out of sample -2 log likelihood 500.885 161.625 510.066 176.451 511.494 177.541

Hit rate – naive 72.6% 73.5% 72.6% 73.5% 72.6% 73.5%

Hit rate 72.8% 76.1% 72.4% 73.5% 73.1% 73.5%

Cox & Snell R2 0.066 0.107 0.047 0.017 0.044 0.010

Nagelkerke R2 0.095 0.156 0.068 0.025 0.063 0.015

McFadden R2 0.058 0.098 0.041 0.015 0.038 0.009

In sample null model: -2 log likelihood = 531.762 Out of sample null model: -2 log likelihood = 179.096

(24)

24 As the figure below shows, the predictive power of this model is very bad. The Gini coefficient is nearly 0, and the TDL is very marginal at only 1.51. The curves make this even more visible: the model sometimes even performs worse than a random selection. It can therefore be concluded that the model with the predictors ‘starting month’, ‘average from previous education’, and ‘urbanisation’ are no good in predicting whether a student quits or not.

Figure 4 – Logit model, HAVO & VWO students: TDL & Gini coefficient

(25)

25

3.3 Cox Proportional-hazards model

In the next paragraph, the procedure from 3.2 will be replicated. That is, first three models for all students will be shown that (1) includes all variables, followed (2) by the variables from the theory (highlighted in red) + all significant variables, concluding with a model (3) that only includes significant variables. The second table will do the same, but only applied to Dutch HAVO & VWO students. Again, an alpha of 10% was used to determine whether variables would be included in the final model using the Wald backward technique.

3.3.1 Hazard model – all students

Model 1 Model 2 Model 3

Hazard ratio Wald (p-value) Hazard ratio Wald (p-value) Hazard ratio Wald (p-value) Age 1.034 1.134 (0.287) 1.037 1.356 (0.244) Gender 0.902 0.710 (0.400) 0.905 0.67 (0.413) Starting month 0.882 0.616 (0.432) Stream (Dutch vs English) 1.022 0.017 (0.897) Days prior to start of

studies 0.998 2.601 (0.107) 0.998 3.737 (0.053)* 0.998 5.694 (0.017)** Nationality5 0.000 11.529 (0.021)** 11.68 (0.020)** 10.515 (0.033)** Nationality –Dutch 1.803 6.021** (0.014) 1.784 8.642 (0.003)*** 1.699 7.540 (0.006)*** Nationality –other west-European 1.704 1.004 (0.316) 1.668 0.927 (0.336) 1.670 0.932 (0.334) Nationality –East-European 2.341 11.021 (0.001)*** 2.299 10.64 (0.001)*** 2.189 9.653 (0.002)*** Nationality –Other 1.646 1.399 (0.237) 1.632 1.359 (0.244) 1.586 1.207 (0.272) * = significant at 0.10 level, ** = significant at 0.05 level, *** = significant at 0.01 level

Table 10 – Hazard models, all students: hazard ratios and p-values

The same variables are significant as in the logit model from paragraph 3.2.1 and the parameters are somewhat comparable too. The interpretation of the results, however, is due to the nature of the duration model a bit different. The application date is significant in this model, with a hazard ratio of 0.998, which means that there is a 0.2% more chance a student will drop out of the studies sooner compared to a student who applies 1 day earlier. For example, a student who applied on the 1st of July (62 days prior to the start) is 24.4% more likely to quit than a student who applied on the 1st of March (184 days prior to the start). This percentage is proportional, which means that for any given month, the student is 24.4% more likely to quit.

(26)

26 Nationality is also a significant parameter, with Dutch and east-European students having a higher hazard ratio compared to German students. This means that Dutch and east-European students have a higher probability to quit at any given month compared to German students, which can also be seen in the graph below.

Figure 5 – Survival function of five different nationality groups

Model choice

(27)

27

Model 1 Model 2 Model 3

Null model 𝜒 2= 21.800, df = 9, p = 0.009 𝜒2= 21.188, df = 7, p = 0.004 𝜒2= 18.556, df = 5, p = 0.002 Model 1 𝜒 2= 0.612, df = 2, p = 0.736 Model 2 𝜒 2= 2.632, df = 2, p = 0.268

Table 11 – Hazard models, all students: model comparison

Model fit and performance

AIC and BIC are types of information criteria that can be used to assess how well a model fits the data. The lower these numbers, the better. They are especially suited if models are not nested (Leeflang et al., 2015), but may be used in addition to the calculated chi-square values as well. Based on the information criteria, it can be concluded that model 3 is preferred over the other 2 models. In other words, the model with only the significant parameters is to be used.

Model 1 Model 2 Model 3

In sample Out of sample In sample Out of sample In sample Out of sample -2 log likelihood 3943.606 954.669 3944.218 954.735 3946.850 955.464 Number of parameters 9 9 7 7 5 5 N 1001 330 1001 330 1001 330 AIC 3961.606 972.669 3958.218 968.735 3956.85 965.464 BIC 4005.785 1006.861 3992.579 995.329 3981.394 984.459

In sample null model: -2 log likelihood = 3965.406 Out of sample null model: -2 log likelihood = 958.476

(28)

28

3.3.2 Hazard model – Dutch HAVO & VWO students only

The interpretation of the table below can be found on the next page.

Model 1 Model 2 Model 3

Hazard ratio Wald (p-value) Hazard ratio Wald (p-value) Hazard ratio Wald (p-value) Age 0.884 2.893 (0.089)* 0.930 1.176 (0.278) Gender 0.801 1.202 (0.273) 0.817 1.114 (0.291) Starting month 0.572 4.440 (0.035)** 0.501 8.285 (0.004)*** 0.529 7.293 (0.007)*** Stream (Dutch vs English) 0.818 0.730 (0.393) Days prior to start of

studies 0.998 1.743 (0.187) Average previous education 0.460 7.103 (0.008)*** 0.462 7.536 (0.006)*** 0.460 7.791 (0.005)*** Profile6 4.689 (0.196) Profile –C+M 1.128 0.268 (0.605) Profile –E+M & C+M 1.414 1.224 (0.269) Profile –N+G and/or N+T 1.679 4.145 (0.042)** Province7 3.070 (0.546) Province –Groningen 0.699 0.871 (0.351) Province –Drenthe 1.320 0.704 (0.402) Province –Overijssel 1.205 0.391 (0.532) Province –Other 1.158 0.273 (0.601) Urbanisation 2.408 6.312 (0.012)** 1.567 3.890 (0.049)** 1.464 2.972 (0.085)* * = significant at 0.10 level, ** = significant at 0.05 level, *** = significant at 0.01 level

Table 13 – Hazard models, HAVO & VWO students: hazard ratios and p-values

(29)

29 For this model, the same variables are significant as for the logit model for the Dutch students. The interpretation of these data are as follows: the hazard ratio for starting month is 0.529, meaning that a student starting in September is 47.1% less likely to quit at any given t compared to a student starting in February. In addition, students with a higher average are 54% less likely to quit for every +1 point on their average from the previous education. Both variables are very significant (p<0.01). Finally, urbanisation shows that students who attended a high school in an urbanised city are 46.4% more likely to quit at any given t compared to students who did not attend a high school in an urbanised city (a list of all urbanised cities and the number of students from these cities can be found in the appendix, page 37. The graphs below show the visualisation of this.

Model choice

It becomes a bit repetitive when assessing the chi-square values of the models, but here too model 3 is preferred over the other two models. The chi-square tests show no significant results between models 1 and 2 and between models 2 and 3, leading to the conclusion that the model with the fewest parameters (model 3) is the preferred model here.

Model 1 Model 2 Model 3

Null model 𝜒 2= 31.916, df = 14, p = 0.004 𝜒2= 21.815, df = 5, p = 0.001 𝜒2= 19.982, df = 3, p = 0.000 Model 1 𝜒 2= 10.101, df = 9, p = 0.342 Model 2 𝜒 2= 1.833, df = 2, p = 0.400

(30)

30 Model fit and performance

The differences in AIC and BIC scores are not great, but the table below also proves that model 3 is preferred over the other two models. Therefore, the model with only the three significant parameters should be used and interpreted.

Model 1 Model 2 Model 3

In sample Out of sample In sample Out of sample In sample Out of sample -2 log likelihood 1457.338 389.224 1467.439 403.253 1469.272 404.096 Number of parameters 14 14 5 5 3 3 N 453 155 453 155 453 155 AIC 1485.338 417.224 1477.439 413.253 1475.272 410.096 BIC 1542.960 459.832 1498.018 428.470 1487.620 419.226

In sample null model: -2 log likelihood = 1489.255 Out of sample null model: -2 log likelihood = 405.152

(31)

31

4 Conclusion

The aim for this thesis was to develop a model that not only describes which students are more likely to quit but also when they quit. These questions were answered using a logit model (for the which question) and a duration model (for the when question). The analysis showed that not many variables are significant parameters in determining which students are more likely to quit when. The analysis was done twice: once for all students and once for only Dutch students whose prior education was HAVO or VWO. It can be concluded that both the logit model and the duration model have similar variables that are significant. However, when comparing the two datasets, we observe differences in significant parameters (see table below).

Logit model Duration model

All students HAVO & VWO

students All students

HAVO & VWO students Age

(continuous) no effect no effect no effect no effect

Gender

(0=male) no effect no effect no effect no effect

Average previous education (continuous)

negative effect negative effect

Starting month

(0=February) no effect negative effect no effect negative effect Days prior to start

of studies (continuous)

negative effect no effect negative effect no effect Nationality

(categorical)

Dutch & east-European

positive effect

Dutch & east-European

positive effect

Urbanisation

(0=not urbanised) positive effect positive effect

Table 16 – Overview of (in)significant parameters in both the logit and duration model

(32)

32 In the introduction, several hypotheses were formulated, which can now be assessed based on the analysis of the results in the previous chapter. Since gender and age were not found to be significant in any of the models, hypotheses 1A, 1B, 2A, and 2B can all be rejected. This is in line with prior research by Amuda, Bulus and Joseph (2016) who also state that age is not a significant variable when evaluating academic performance. Naderi et al. (2009) conclude the same with respect to gender, so the results from this analysis are in line with their results. However, the results from this analysis do contradict with the general notion about these two variables, namely that male students are more likely to quit than females and that older students are more likely to quit than younger students (as discussed in the first chapter).

Hypotheses 3A and 3B can be confirmed, since the average from the previous education does matter with respect to dropping out of the studies. Students with a higher average are less likely to drop out compared to students with a lower average. In addition, students with a higher average are also less likely to drop out sooner, so they seem to be more persistent in continuing the studies.

Hypothesis Description Support

H1A Older students are more likely to drop out of the studies during the first

year compared to younger students. No

H1B Older students are more likely to drop out of the studies sooner during the

first year compared to younger students. No

H2A Male students are more likely to drop out the studies during the first year

compared to female students. No

H2B Male students are more likely to drop out of the studies sooner during the

first year compared to female students. No

H3A

Dutch students who have a lower average from their previous education (HAVO/VWO) are more likely to drop out of the studies during the first year compared to Dutch students with a higher average.

Yes

H3B

Dutch students who have a lower average from their previous education (HAVO/VWO) are more likely to drop out of the studies sooner during the first year compared to Dutch students with a higher average.

Yes

(33)

33

5 Recommendations and managerial implications

The analyses done in the prior chapters showed that the descriptive model showed some significant parameters that explain whether a student quits or not. Dutch ‘at risk’ students could be labelled as ‘February starters, who attended high school in an urbanised city, and who did not perform very well at their high school’. On the other hand, all ‘at risk’ students could be described as Dutch or east-European students who applied for the studies at a late moment (i.e. up to 3 months prior to the start of the studies). The education M&EM could benefit from this information by following these students more closely in the first few months of their studies, by e.g. offering better guidance in the form of meetings with a student counsellor or by offering additional classes in case they are struggling with a particular course.

One of the results from this paper was that students who apply for the education late, have a higher chance of dropping out in their first years. Students who apply late could be invited to a hearing before they actually start. Another option is that M&EM looks into disabling the possibility of enrolling after a certain month altogether. This may, however, not be easy, as these dates are mainly

determined by the Dutch government.

The Inspectie van het Onderwijs (2017) analyses the Dutch high schools yearly and reports their findings. One of their findings was that high schools in urbanised areas (e.g. Groningen, Leeuwarden) have more dispersion in their classes in terms of level difference between students. This could mean that underperforming students deserve and receive more attention that students that are doing better. This means that these students, on average, receive less supervision from their tutors compared to schools that have less dispersion. Another finding of the Inspectie van het Onderwijs (2017) was that the province of Groningen has relatively many ‘poor performing’ high schools. The city of Groningen has 28 high schools out of 67 in the province in total, so nearly half of the high schools are situated in the city itself. The analysis of this paper showed that students who attended a high school in an urbanised city have a higher chance of dropping out. These two facts (more

(34)

34

6 Limitations and future research

There are several other predictors of academic performance, which were not incorporated in this analysis. The main reason for is that these variables (e.g. motivation, study skills, university satisfaction) are unknown at the start of the studies, and it is most beneficial for a university to identify possible quitters right at the start of a new academic year. Questionnaires could be

developed based on these variables, so that these predictors can also be included in the model with the aim of optimising the power of the model. Other variables such as class attendance and grades obtained in the first few courses could be observed and entered into the model too. These should be especially projected at students ‘at risk’ which is the result of the current analysis.

In the Netherlands, it is quite common to visit a so-called open day before choosing and starting your studies. During this open day, students can visit the university, ask questions, visit lectures, talk to students and lecturers and ‘breathe student life’. It could be argued that having visited an open day could influence a student’s decision to apply for a certain university (or city). Having seen the city and university prior to starting may result in a more motivated student and in turn in a lower chance of quitting the studies. Therefore, it is recommended to include this variable too in future analyses.

(35)

35

7 References

Agresti, A., & Kateri, M. (2011). Categorical data analysis. Berlin: Springer.

Amuda, B. G., Bulus, A. K., & Joseph, H. P. (2016). Marital Status and Age as Predictors of Academic Performance of Students of Colleges of Education in North-East Nigeria. American Journal of

Educational Research, 4(12), pp. 896-902.

Araque, F., Roldán, C., & Salguero A. (2009). Factors influencing university dropout rates. Computers &

Education 53(3), pp. 563-574.

Blattberg, R. C., Kim, B. D., & Neslin, S. A. (2008). Database marketing: Analyzing and managing

customers. New York: Springer Science & Business Media.

Burrus, J., & Roberts, R. D. (2012). Dropping out of high school: Prevalence, risk factors, and remediation strategies. R & D Connections, 18, pp. 1-9.

Calderon, J. M., Robles, R. R., Reyes, J. C., Matos, T. D., Negrón, J. L., & Cruz, M. A. (2009). Predictors of school dropout among adolescents in Puerto Rico. Puerto Rico health sciences journal, 28(4), pp. 307-312.

Cone, A. L., & Owens, S. K. (1991). Academic and locus of control enhancement in a freshman study skills and college adjustment course. Psychological Reports, 68(3 suppl), pp. 1211-1217.

Ebenuwa-Okoh, E. E. (2010). Influence of age, financial status, and gender on academic performance among undergraduates. Journal of Psychology, 1(2), pp. 99-103.

Franklin, B. J., & Trouard, S. B. (2014). An analysis of dropout predictors within a state high school graduation panel. Schooling, 5, pp. 1-8.

Inspectie van het Onderwijs (2017). Onderwijsverslag – De Staat van het Onderwijs.

(36)

36 Leeflang, P.S.H., Wieringa, J.E., Bijmolt, T.H.A., & Pauwels, K.H. (2015). Modeling Markets. Dordrecht: Springer.

Malhotra, N. K. (2010). Marketing research: An applied approach. Amsterdam: Pearson Education.

Marquez-Vera, C., Romero, C., & Ventura, S. (2011). Predicting school failure using data mining.

Proceedings of the 4th International Conference on Educational Data Mining, pp. 271-276.

McKenzie, K., & Schweitzer, R. (2001). Who succeeds at university? Factors predicting academic performance in first year Australian university students. Higher education research & development, 20(1), pp. 21-33.

Naderi, H., Abdullah, R., Aizan, H. T., Sharir, J., & Kumar, V. (2009). Creativity, age and gender as predictors of academic achievement among undergraduate students. Journal of American Science, 5(5), pp. 101-112.

Sladek, R.M., Bond, M.J., Frost, L.K., & Prior, K.N. (2016). Predicting success in medical school: a longitudinal study of common Australian student selection tools. BCM Medical Education, 16(1), pp. 1-7.

Tabachnick, B. G., Fidell, L. S. (2012). Using multivariate statistics. Amsterdam: Pearson Education.

Vereniging van Hogescholen (2016). Feiten en cijfers: afgestudeerden en uitvallers in het hoger

beroepsonderwijs.

Wince, M. H., & Borden, V. M. (1995). When Does Student Satisfaction Matter? AIR 1995 Annual Forum Paper.

(37)

37

8 Appendices

High school city Number of students Amstelveen 2 Amsterdam 36 Apeldoorn 3 Breda 3 Bussum 2 Deventer 1 Enschede 3 Gouda 1 Groningen 132 Haarlem 7 Helmond 1 Hilversum 9 Leeuwarden 299 Leiden 2 Maastricht 2 Nijmegen 2 Rotterdam 1 's-Gravenhage8 10 Tilburg 2 Utrecht 11 Vlissingen 1 Zwolle 11 Total 541

Table 18 – High school attendance in urbanised cities

(38)

Students’ dropout rates

Descriptive and predictive, using logit and duration models

(39)

Reason & relevance

M&EM

High dropout rates (20-30%) compared to national

average (15-16%)

Costly for both student and university

Management and marketing can both target

(40)

Method

Data obtained from 5 academic years

10 variables (mostly demographic) included

2 models used: logit (whether), duration (when)

(41)

Results: logit

All students: application day, nationality

R squares very low: between 1% and 3%

Hit rate same as naive model

TDL: 1.43, Gini: 0.08  low predictive power

Dutch students: starting month, average previous

education, ‘location’ high school

R squares low: 4%-6% (in sample), 1%-2% (out of sample)

Hit rate only marginally better than naive model

(42)
(43)
(44)

Results: duration

All students: application day, nationality

1 day sooner  0.2%

less

likely

Dutch (70%) & East-European (119%) compared to German

more

likely

Dutch students: starting month, average previous

education, ‘location’ high school

September  47%

less

likely

(45)
(46)
(47)

Conclusions & managerial implications

Descriptive part shows significant variables for

both models and both data sets.

Predictive power of logit model is low.

Follow ‘at risk’ students carefully at the start of

their studies.

Gather additional variables to increase the power

(48)

Future research

Apply other models, such as machine learning

techniques

Include other variables that are known to affect

academic performance (motivation, study skills)

Referenties

GERELATEERDE DOCUMENTEN

We considered three datasets: a dataset with pre-university data only containing 495 instances (242 instances classified as unsuccessful, 253 instances classified as successful),

The study objectives were to (1) examine sociodemographic and clinical factors that influence the likeli- hood of attrition in PROFILES, and (2) investigate differences in

(c) CLSM image of giant Ps prepared by adding a solution of PEG-PLA in chloroform to PBS in the presence of Nile red as a fluorescent probe [48]. Polymersomes for drug loading

Ek is baie bly dat u gevestigde kapitaalbelange in- by hierdie gclcenthede het Oom eerstehandse getuienis gclcwer mckaar en dan word die aantal Paul dikwels

In this chapter, we want to prepare the construction of the subsolution for the stability result by looking at the discrete heat kernel, which is crucial for describing the

Voor haar staat niet alleen voedselschaarste centraal maar ook de manier waarop de mens met het milieu omgaat binnen het huidige 'industriële paradigma' en de manier waarop armen

See also https://i-teach.unige.it/.. I*Teach balance between freedom and absolute direction); - large group of trained teachers joint to the community of

Indeed, after the analysis (coding and grouping), several themes emerged from the content of the interviews: reasons to turn to online dating, advantages and disadvantages of