• No results found

University of Groningen New rules, new tools Niessen, Anna Susanna Maria

N/A
N/A
Protected

Academic year: 2021

Share "University of Groningen New rules, new tools Niessen, Anna Susanna Maria"

Copied!
23
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

New rules, new tools

Niessen, Anna Susanna Maria

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from it. Please check the document version below.

Document Version

Publisher's PDF, also known as Version of record

Publication date: 2018

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):

Niessen, A. S. M. (2018). New rules, new tools: Predicting academic achievement in college admissions. Rijksuniversiteit Groningen.

Copyright

Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

Take-down policy

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the number of authors shown on this cover page is limited to 10 maximum.

(2)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 65PDF page: 65PDF page: 65PDF page: 65

studying for or taking the tests, or by obtaining low scores on the tests. In future studies, a measure of enrollment intentions may be included before the admission tests are administered. In addition, the possibility to promote self-selection through curriculum sampling may also be used in procedures aimed at placement decisions or advising on student-program-fit. In future research it may be

investigated whether the predictive validity of curriculum-sampling tests generalizes to lower-stakes procedures.

A possible explanation for the lower noncognitive saturation of the curriculum-sampling tests as compared to first year GPA is that the curriculum samples were not representative enough for first year courses, which may require more

prolonged effort. This is supported by the finding that the grade on the first course in the program was a much better predictor of short-term and long-term academic performance than the scores on the curriculum-sampling tests. Whereas the literature-based curriculum-sampling test mimicked that course, the course required studying an entire book, while the curriculum-sampling test only

required studying two chapters. A more comprehensive curriculum sample may be more saturated with constructs related to effortful behavior.

3.8.2 Conclusions

From existing research we know that high school GPA is a good predictor of future academic performance in higher education and is arguably the most efficient measure to use for admission decisions. In the present study we showed that the literature-based curriculum-sampling tests mostly showed similar or slightly higher predictive validity than high school GPA for first-year academic outcomes, and similar or slightly lower predictive validity for third year academic outcomes. In addition, this curriculum-sampling test showed incremental validity over high school GPA. One caveat is that the tests should have an acceptable psychometric quality and that extra resources are needed to construct and administer the tests. However, in the format adopted in this study, constructing and administering the tests took relatively little time and effort, and the reliability was sufficient. A final advantage is that curriculum sampling is perceived as a favorable admission method, whereas applicants disliked the use of high school GPA in admission procedures (Niessen, Meijer, & Tendeiro, 2017a). So, in cases where using high school GPA is not feasible, or when self-selection or applicant perceptions are of major interest, curriculum-sampling tests may be preferred over, or may be used in addition to high school GPA.

New Rules, New Tools:

Predicting academic achievement

in college admissions

Susan Niessen

New Rules, New Tools:

Predicting academic achievement

in college admissions

Susan Niessen

New Rules, New Tools:

Predicting academic achievement

in college admissions

Susan Niessen

New Rules, New Tools:

Predicting academic achievement

in college admissions

Susan Niessen

New Rules, New Tools:

Predicting academic achievement

in college admissions

Susan Niessen

Chapter 4

Curriculum sampling

and differential prediction

by gender

This chapter was based on the manuscript:

Niessen, A. S. M., Meijer, R. R., & Tendeiro, J. N. (2017).

Gender-based differential prediction by curriculum samples for college admissions.

(3)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 66PDF page: 66PDF page: 66PDF page: 66

Abstract

A longstanding concern about admissions to higher education is the

underprediction of female academic performance by traditional admission test scores, and to a lesser extent, by high school GPA. One explanation is that

predictors that are also related to gender and academic performance are omitted from the prediction model. Noncognitive characteristics are often mentioned as potentially important omitted variables. Therefore, it is often advised to take them into account when making admission decisions. However, the self-report format that is used to assess such variables is not suitable for high-stakes admission contexts. An alternative approach is using representative performance samples. We examined differential prediction of academic performance by gender based on curriculum samples that represented a varying degree of later criterion behavior, using both frequentist and Bayesian analyses. Our results showed that differential prediction was not completely eliminated, but that the effects for slope and intercept differences were small, and that the effect sizes were smaller for more comprehensive curriculum samples. We conclude that curriculum sampling may offer a practically feasible solution for reducing differential prediction by gender in high-stakes operational selection settings, without having to rely on easily fakable self-report questionnaires to measure noncognitive traits and skills.

4.1 Introduction

Having a college degree determines to a large extent an individual’s employment opportunities and is “more indicative of income, of attitudes, and of political behavior than (…) region, race, age, religion, sex and class” (Lemann, 1999, p. 6). It is thus extremely important for individuals and society that access to higher education is fair and is not biased against, for example, gender, ethnicity, or socio-economic status. In this context, bias is often defined as differential prediction; that is, a systematic difference in performance between subgroups for any given test score occurs (Guion, 1989). Differential prediction is usually studied using moderated multiple regression models, examining differences in intercepts and differences in slopes (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014, p. 66; Cleary, 1968), although substantial slope differences are rarely found (e.g., Aguinis, Culpepper, & Pierce, 2010).

Given the major interests that are at stake, it is not surprising that differential prediction by gender is a well-researched area in pre-employment testing and in college admission testing. An often-reported finding is that cognitive test scores and traditional admission test scores show underprediction for women, and overprediction for men; that is, women obtain better academic results than predicted by their admission test scores, whereas men obtain lower academic results than predicted (e.g., Fischer, Schult & Hell, 2013; Keiser, Sackett, Kuncel, & Brothen; 2016; Mattern & Patterson, 2013). Although some authors have “blamed the tests”, others suggested that differential prediction by gender may be caused by the omission of valid variables from the prediction model that are both related to the criterion variable (e.g., college GPA) and to gender (Jencks, 1998; Sackett, Laczo, & Lippe, 2003). For example, Keiser et al. (2016) found that adding conscientiousness scores and course taking patterns to prediction models containing standardized admission test scores reduced female underprediction. Note that conscientiousness was an omitted variable in the predictor, and that the course-taking pattern was an omitted variable in the criterion. Consequently, Keiser et al. (2016) and other researchers (e.g., Goldstein, Zedeck, & Goldstein, 2002) recommended the inclusion of such omitted predictors in selection decisions, and called for further research on effective ways to assess such, mostly noncognitive, characteristics.

4.1.1 Assessment in High-stakes Contexts

There is a considerable number of studies that aim to design and evaluate

admission instruments that measure noncognitive attributes (e.g., Kappe & van der Flier, 2012; Kyllonen, Lipnevic, Burrus, & Roberts, 2014; Kyllonen, Walter, &

(4)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 67PDF page: 67PDF page: 67PDF page: 67

Abstract

A longstanding concern about admissions to higher education is the

underprediction of female academic performance by traditional admission test scores, and to a lesser extent, by high school GPA. One explanation is that

predictors that are also related to gender and academic performance are omitted from the prediction model. Noncognitive characteristics are often mentioned as potentially important omitted variables. Therefore, it is often advised to take them into account when making admission decisions. However, the self-report format that is used to assess such variables is not suitable for high-stakes admission contexts. An alternative approach is using representative performance samples. We examined differential prediction of academic performance by gender based on curriculum samples that represented a varying degree of later criterion behavior, using both frequentist and Bayesian analyses. Our results showed that differential prediction was not completely eliminated, but that the effects for slope and intercept differences were small, and that the effect sizes were smaller for more comprehensive curriculum samples. We conclude that curriculum sampling may offer a practically feasible solution for reducing differential prediction by gender in high-stakes operational selection settings, without having to rely on easily fakable self-report questionnaires to measure noncognitive traits and skills.

4.1 Introduction

Having a college degree determines to a large extent an individual’s employment opportunities and is “more indicative of income, of attitudes, and of political behavior than (…) region, race, age, religion, sex and class” (Lemann, 1999, p. 6). It is thus extremely important for individuals and society that access to higher education is fair and is not biased against, for example, gender, ethnicity, or socio-economic status. In this context, bias is often defined as differential prediction; that is, a systematic difference in performance between subgroups for any given test score occurs (Guion, 1989). Differential prediction is usually studied using moderated multiple regression models, examining differences in intercepts and differences in slopes (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014, p. 66; Cleary, 1968), although substantial slope differences are rarely found (e.g., Aguinis, Culpepper, & Pierce, 2010).

Given the major interests that are at stake, it is not surprising that differential prediction by gender is a well-researched area in pre-employment testing and in college admission testing. An often-reported finding is that cognitive test scores and traditional admission test scores show underprediction for women, and overprediction for men; that is, women obtain better academic results than predicted by their admission test scores, whereas men obtain lower academic results than predicted (e.g., Fischer, Schult & Hell, 2013; Keiser, Sackett, Kuncel, & Brothen; 2016; Mattern & Patterson, 2013). Although some authors have “blamed the tests”, others suggested that differential prediction by gender may be caused by the omission of valid variables from the prediction model that are both related to the criterion variable (e.g., college GPA) and to gender (Jencks, 1998; Sackett, Laczo, & Lippe, 2003). For example, Keiser et al. (2016) found that adding conscientiousness scores and course taking patterns to prediction models containing standardized admission test scores reduced female underprediction. Note that conscientiousness was an omitted variable in the predictor, and that the course-taking pattern was an omitted variable in the criterion. Consequently, Keiser et al. (2016) and other researchers (e.g., Goldstein, Zedeck, & Goldstein, 2002) recommended the inclusion of such omitted predictors in selection decisions, and called for further research on effective ways to assess such, mostly noncognitive, characteristics.

4.1.1 Assessment in High-stakes Contexts

There is a considerable number of studies that aim to design and evaluate

admission instruments that measure noncognitive attributes (e.g., Kappe & van der Flier, 2012; Kyllonen, Lipnevic, Burrus, & Roberts, 2014; Kyllonen, Walter, &

67

(5)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 68PDF page: 68PDF page: 68PDF page: 68

Kaufman, 2005; Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004; Shultz & Zedeck, 2012), but these studies were mostly based on self-report instruments

administered in low-stakes contexts. In high-stakes contexts, faking behavior or impression management poses a serious problem to the use of self-report

instruments (Brown, 2016; Niessen, Meijer, & Tendeiro, 2017b; Peterson, Griffith, Isaacson, O'Connell, & Mangos, 2011). As Niessen and Meijer (2017) recently discussed, there are almost no studies that investigated the effectiveness of noncognitive predictors, such as personality in high-stakes educational contexts, and the generalizability of the validity of scores obtained in low-stakes conditions on such instruments to high-stakes admission conditions is limited (see chapter 5: Niessen et al., 2017b; Peterson et al., 2011). So, the finding that adding

noncognitive variables to prediction models reduced differential prediction offer important insights, but provide no practical solution to reducing differential prediction in high-stakes contexts. Thus, alternative methods to capture noncognitive characteristics are needed.

In this study, we investigated differential prediction of academic performance by gender using a curriculum-sampling approach. In this approach we do not measure cognitive or noncognitive traits or skills as “signs” to predict future behavior, but we use samples of relevant performance, without distinguishing different

psychological constructs (Wernimont & Campbell, 1968). Furthermore, we studied differential prediction with a frequentist and a Bayesian (Kruschke, Aguinis, & Joo, 2012) moderated multiple regression approach. Bayesian analyses are becoming increasingly popular, but there are very few applications yet within the

organizational and educational domain. As we explain below, the Bayesian approach provides some very useful tools to study differential prediction and allows us to answer questions that we cannot answer using a frequentist approach. 4.1.2 Curriculum Sampling

One alternative approach to measuring distinct cognitive and noncognitive traits and skills in admission procedures is using representative samples of relevant performance. Such representative performance samples should tap into different cognitive and noncognitive traits and skills that are related to criterion

performance as well (e.g., Callinan & Robertson, 2000; Hough, Oswald, & Ployhart, 2001; Lievens & De Soete, 2012).

High school GPA, which is often used for admission to (European) higher education, can be defined as such a multifaceted performance sample that measures cognitive abilities, and the ability to ‘get it done’ (Bowen, Chingos, & McPherson, 2009, p. 123). Correspondingly, high school GPA is a good predictor of

academic performance and showed less differential prediction than standardized test scores (Mattern, Patterson, Shaw, Kobrin, & Barbuti, 2008; Zwick, 2017; Zwick & Himmelfarb, 2011). However, there are practical disadvantages to using high school GPA in admission procedures, such as negative applicant reactions (Niessen, Meijer, & Tendeiro, 2017a), and differing grading practices across high schools and countries leading to comparability problems (Zwick, 2017, p. 57). Alternatively, at European universities, applicants are increasingly selected on the basis of curriculum sampling or curriculum-sampling tests (de Visser et al., 2017; Lievens & Coetsier, 2002; Niessen, Meijer, & Tendeiro, 2016; Vihavainen,

Luukkainen, & Kurhila, 2013). These curriculum-sampling tests are analogous to the well-known work-sample tests used in personnel selection (e.g., Callinan & Robertson, 2000), both based on the model of behavioral consistency (Wernimont & Campbell, 1968). The basic idea is simple: Instead of relying on cognitive skills scores (e.g., SAT/ACT scores) in combination with noncognitive traits scores (e.g., conscientiousness questionnaire scores), applicants have to perform tasks that are similar to the tasks in their future study program. For undergraduate admission this usually consists of studying domain-specific material and taking an exam, but this approach can also be used to assess practical skills (Vihavainen et al., 2013) or communication skills and ethical reasoning skills (Reiter, Eva, Rosenfeld, &

Norman, 2007). The advantages of a performance-sampling approach are: (1) High predictive validity (Niessen et al., 2016; Schmidt & Hunter, 1998), because

predictor and criterion measures are matched in content (Sackett, Walmsley, Koch, Beatty, & Kuncel, 2016); (2) High face validity and positive applicant perceptions (e.g., Anderson, Salgado, & Hülsheger, 2010; Niessen et al., 2017a, see chapter 6), and (3) No dependence on easily fakable and, therefore, less valid self-report measures (see chapter 5: Niessen et al., 2017b).

So, in performance sampling, cognitive and noncognitive skills are not measured in isolation, but within representative tasks that are hypothesized to require a mixture of cognitive and noncognitive skills (Callinan & Robertson, 2000; Hough, et al., 2001). Several authors suggested that using samples of relevant behavior or performance may reduce adverse impact (Callinan & Robertson, 2000; Hough et al., 2001; Ployhart & Holtz, 2008) and may lead to reduced or no differential prediction (e.g., Aramburu-Zabala Higuera, 2001; Robertson & Kandola, 1982). In the context of personnel selection, performance-sampling approaches showed lower mean score differences by gender than traditional cognitive tests (Dean, Roth, & Bobko, 2008; Lievens & De Soete, 2012; Roth, Buster, & Barnes-Farrell, 2010; Schmitt & Mills, 2001). Note, however, that most of these studies did not take the differences in reliability of the measures into account. To our knowledge,

(6)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 69PDF page: 69PDF page: 69PDF page: 69

Kaufman, 2005; Oswald, Schmitt, Kim, Ramsay, & Gillespie, 2004; Shultz & Zedeck, 2012), but these studies were mostly based on self-report instruments

administered in low-stakes contexts. In high-stakes contexts, faking behavior or impression management poses a serious problem to the use of self-report

instruments (Brown, 2016; Niessen, Meijer, & Tendeiro, 2017b; Peterson, Griffith, Isaacson, O'Connell, & Mangos, 2011). As Niessen and Meijer (2017) recently discussed, there are almost no studies that investigated the effectiveness of noncognitive predictors, such as personality in high-stakes educational contexts, and the generalizability of the validity of scores obtained in low-stakes conditions on such instruments to high-stakes admission conditions is limited (see chapter 5: Niessen et al., 2017b; Peterson et al., 2011). So, the finding that adding

noncognitive variables to prediction models reduced differential prediction offer important insights, but provide no practical solution to reducing differential prediction in high-stakes contexts. Thus, alternative methods to capture noncognitive characteristics are needed.

In this study, we investigated differential prediction of academic performance by gender using a curriculum-sampling approach. In this approach we do not measure cognitive or noncognitive traits or skills as “signs” to predict future behavior, but we use samples of relevant performance, without distinguishing different

psychological constructs (Wernimont & Campbell, 1968). Furthermore, we studied differential prediction with a frequentist and a Bayesian (Kruschke, Aguinis, & Joo, 2012) moderated multiple regression approach. Bayesian analyses are becoming increasingly popular, but there are very few applications yet within the

organizational and educational domain. As we explain below, the Bayesian approach provides some very useful tools to study differential prediction and allows us to answer questions that we cannot answer using a frequentist approach. 4.1.2 Curriculum Sampling

One alternative approach to measuring distinct cognitive and noncognitive traits and skills in admission procedures is using representative samples of relevant performance. Such representative performance samples should tap into different cognitive and noncognitive traits and skills that are related to criterion

performance as well (e.g., Callinan & Robertson, 2000; Hough, Oswald, & Ployhart, 2001; Lievens & De Soete, 2012).

High school GPA, which is often used for admission to (European) higher education, can be defined as such a multifaceted performance sample that measures cognitive abilities, and the ability to ‘get it done’ (Bowen, Chingos, & McPherson, 2009, p. 123). Correspondingly, high school GPA is a good predictor of

academic performance and showed less differential prediction than standardized test scores (Mattern, Patterson, Shaw, Kobrin, & Barbuti, 2008; Zwick, 2017; Zwick & Himmelfarb, 2011). However, there are practical disadvantages to using high school GPA in admission procedures, such as negative applicant reactions (Niessen, Meijer, & Tendeiro, 2017a), and differing grading practices across high schools and countries leading to comparability problems (Zwick, 2017, p. 57). Alternatively, at European universities, applicants are increasingly selected on the basis of curriculum sampling or curriculum-sampling tests (de Visser et al., 2017; Lievens & Coetsier, 2002; Niessen, Meijer, & Tendeiro, 2016; Vihavainen,

Luukkainen, & Kurhila, 2013). These curriculum-sampling tests are analogous to the well-known work-sample tests used in personnel selection (e.g., Callinan & Robertson, 2000), both based on the model of behavioral consistency (Wernimont & Campbell, 1968). The basic idea is simple: Instead of relying on cognitive skills scores (e.g., SAT/ACT scores) in combination with noncognitive traits scores (e.g., conscientiousness questionnaire scores), applicants have to perform tasks that are similar to the tasks in their future study program. For undergraduate admission this usually consists of studying domain-specific material and taking an exam, but this approach can also be used to assess practical skills (Vihavainen et al., 2013) or communication skills and ethical reasoning skills (Reiter, Eva, Rosenfeld, &

Norman, 2007). The advantages of a performance-sampling approach are: (1) High predictive validity (Niessen et al., 2016; Schmidt & Hunter, 1998), because

predictor and criterion measures are matched in content (Sackett, Walmsley, Koch, Beatty, & Kuncel, 2016); (2) High face validity and positive applicant perceptions (e.g., Anderson, Salgado, & Hülsheger, 2010; Niessen et al., 2017a, see chapter 6), and (3) No dependence on easily fakable and, therefore, less valid self-report measures (see chapter 5: Niessen et al., 2017b).

So, in performance sampling, cognitive and noncognitive skills are not measured in isolation, but within representative tasks that are hypothesized to require a mixture of cognitive and noncognitive skills (Callinan & Robertson, 2000; Hough, et al., 2001). Several authors suggested that using samples of relevant behavior or performance may reduce adverse impact (Callinan & Robertson, 2000; Hough et al., 2001; Ployhart & Holtz, 2008) and may lead to reduced or no differential prediction (e.g., Aramburu-Zabala Higuera, 2001; Robertson & Kandola, 1982). In the context of personnel selection, performance-sampling approaches showed lower mean score differences by gender than traditional cognitive tests (Dean, Roth, & Bobko, 2008; Lievens & De Soete, 2012; Roth, Buster, & Barnes-Farrell, 2010; Schmitt & Mills, 2001). Note, however, that most of these studies did not take the differences in reliability of the measures into account. To our knowledge,

69

(7)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 70PDF page: 70PDF page: 70PDF page: 70

there are no studies that investigated differential prediction of performance- or curriculum samples.

4.1.3 Aim of the Present Study

In the present study we investigated differential prediction by gender using curriculum samples as predictors of academic performance. We hypothesized that strong similarity between the predictor and the criterion measures should minimize or eliminate differential prediction. As an example, assume the

hypothetical case in which we would use students’ results obtained in the first year as an admission criterion for the rest of an undergraduate program. Then,

predictor and criterion measures are extremely similar in content and one would expect no or only trivial differential prediction. To investigate this hypothesis, we studied three different curriculum sample-criterion combinations with an

increasing degree of sample comprehensiveness: (1) curriculum sample admission test scores to predict first year GPA, (2) first exam scores (first course grades) in the program to predict first year GPA, and (3) first year GPA to predict third year GPA. The latter two combinations are not always practically feasible, but serve to explore our hypothesis that differential prediction would be reduced when the representativeness of the curriculum sample increased.

To study these expectations we used both a frequentist and a Bayesian (e.g., Kruschke, Aguinis, & Joo, 2012) step-down regression approach (Lautenschlager & Mendoza, 1986). A Bayesian approach is particularly suitable in this study because, contrary to the frequentist approach, it allows us to examine the evidence in favor of the null hypothesis of no differential prediction. This is, for example, interesting when we study slope differences (Aguinis et al., 2010; Fischer et al., 2013; Keiser et al., 2016; Mattern & Patterson, 2013). Contrary to the interpretations in some studies (e.g., Hough et al., 2001), the absence of statistically significant slope differences based on frequentist analyses does not imply that they are nonexistent. Using a Bayesian approach, we can quantify how much the data support the null hypothesis of no slope differences. So, the aim of this paper was twofold: First, we investigated if a curriculum-sampling approach could minimize differential prediction by gender in a high-stakes context. Second, we used a Bayesian approach to differential prediction analyses to illustrate how this technique can contribute to the interpretation of differential prediction results and thus to the sound development of differential prediction analyses.

4.2 Method 4.2.1Participants

The samples included all applicants to an undergraduate psychology program at a Dutch university. The data consisted of applicants who applied to the program in 2013, 2014, or 2015, and who subsequently enrolled in the program and

participated in at least one course. All participants completed a curriculum-sampling test in the admission procedure. The admission committee did not reject any applicants because the number of applicants who did not withdraw their application did not exceed the number of available places. However, this was not known beforehand and the procedure was thus perceived as high-stakes. The students followed the study program either in English or in Dutch, with similar content. The majority of the students who followed the English program were international students, mostly from Germany. All data were obtained through the university administration. This study was approved by and in accordance with the rules of the Ethical Committee Psychology from the university.

Sample 1

Sample 1 consisted of the 638 applicants who applied to the program and enrolled in 2013. Seventy percent was female and the mean age was M = 20 (SD = 2.0). The Dutch program was followed by 43% of the students. The nationalities of the applicants was 46.9% Dutch, 40.6% German, 9.4% other European countries, and 3.1% non-European.

Sample 2

Sample 2 consisted of the 635 applicants who applied to the program and enrolled in 2014. Sixty-six percent was female and the mean age was M = 20 (SD = 1.7). The Dutch program was followed by 42% of the students. The nationalities were 44.7% Dutch, 45.8% German, 7.2% other European countries, and 2.2% non-European. Sample 3

Sample 3 consisted of the 531 applicants who applied to the program and enrolled in 2015. Seventy percent was female and the mean age was M = 20 (SD = 2.0). The Dutch program was followed by 38% of the students. The nationalities were 43.3% Dutch, 45.8% German, 9% other European countries, and 1.9% non-European. 4.2.2 Measures

Curriculum-sampling test

The curriculum-sampling test was designed to mimic the first course in the program: Introduction to Psychology. The applicants had to study two chapters of the book used in this course. On the selection day, which took place at the

(8)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 71PDF page: 71PDF page: 71PDF page: 71

there are no studies that investigated differential prediction of performance- or curriculum samples.

4.1.3 Aim of the Present Study

In the present study we investigated differential prediction by gender using curriculum samples as predictors of academic performance. We hypothesized that strong similarity between the predictor and the criterion measures should minimize or eliminate differential prediction. As an example, assume the

hypothetical case in which we would use students’ results obtained in the first year as an admission criterion for the rest of an undergraduate program. Then,

predictor and criterion measures are extremely similar in content and one would expect no or only trivial differential prediction. To investigate this hypothesis, we studied three different curriculum sample-criterion combinations with an

increasing degree of sample comprehensiveness: (1) curriculum sample admission test scores to predict first year GPA, (2) first exam scores (first course grades) in the program to predict first year GPA, and (3) first year GPA to predict third year GPA. The latter two combinations are not always practically feasible, but serve to explore our hypothesis that differential prediction would be reduced when the representativeness of the curriculum sample increased.

To study these expectations we used both a frequentist and a Bayesian (e.g., Kruschke, Aguinis, & Joo, 2012) step-down regression approach (Lautenschlager & Mendoza, 1986). A Bayesian approach is particularly suitable in this study because, contrary to the frequentist approach, it allows us to examine the evidence in favor of the null hypothesis of no differential prediction. This is, for example, interesting when we study slope differences (Aguinis et al., 2010; Fischer et al., 2013; Keiser et al., 2016; Mattern & Patterson, 2013). Contrary to the interpretations in some studies (e.g., Hough et al., 2001), the absence of statistically significant slope differences based on frequentist analyses does not imply that they are nonexistent. Using a Bayesian approach, we can quantify how much the data support the null hypothesis of no slope differences. So, the aim of this paper was twofold: First, we investigated if a curriculum-sampling approach could minimize differential prediction by gender in a high-stakes context. Second, we used a Bayesian approach to differential prediction analyses to illustrate how this technique can contribute to the interpretation of differential prediction results and thus to the sound development of differential prediction analyses.

4.2 Method 4.2.1Participants

The samples included all applicants to an undergraduate psychology program at a Dutch university. The data consisted of applicants who applied to the program in 2013, 2014, or 2015, and who subsequently enrolled in the program and

participated in at least one course. All participants completed a curriculum-sampling test in the admission procedure. The admission committee did not reject any applicants because the number of applicants who did not withdraw their application did not exceed the number of available places. However, this was not known beforehand and the procedure was thus perceived as high-stakes. The students followed the study program either in English or in Dutch, with similar content. The majority of the students who followed the English program were international students, mostly from Germany. All data were obtained through the university administration. This study was approved by and in accordance with the rules of the Ethical Committee Psychology from the university.

Sample 1

Sample 1 consisted of the 638 applicants who applied to the program and enrolled in 2013. Seventy percent was female and the mean age was M = 20 (SD = 2.0). The Dutch program was followed by 43% of the students. The nationalities of the applicants was 46.9% Dutch, 40.6% German, 9.4% other European countries, and 3.1% non-European.

Sample 2

Sample 2 consisted of the 635 applicants who applied to the program and enrolled in 2014. Sixty-six percent was female and the mean age was M = 20 (SD = 1.7). The Dutch program was followed by 42% of the students. The nationalities were 44.7% Dutch, 45.8% German, 7.2% other European countries, and 2.2% non-European. Sample 3

Sample 3 consisted of the 531 applicants who applied to the program and enrolled in 2015. Seventy percent was female and the mean age was M = 20 (SD = 2.0). The Dutch program was followed by 38% of the students. The nationalities were 43.3% Dutch, 45.8% German, 9% other European countries, and 1.9% non-European. 4.2.2 Measures

Curriculum-sampling test

The curriculum-sampling test was designed to mimic the first course in the program: Introduction to Psychology. The applicants had to study two chapters of the book used in this course. On the selection day, which took place at the

71

(9)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 72PDF page: 72PDF page: 72PDF page: 72

university, they took a multiple-choice exam about the material. A course

instructor constructed the exam. Each year, the exams consisted of different items (40 items in 2013 and 2014, 39 items in 2015); the estimated reliability of the tests was α = .81 in 2013, α = .82 in 2014, and α = .76 in 2015. International students could request to take the exams online7. For more details about the admission

procedure see Niessen et al. (2016). First grade

The first grade obtained in the program qualifies as the result of a more

comprehensive curriculum sample than the admission test. This first grade was the grade obtained at the first exam of the course Introduction to Psychology (not including resit scores), this course covered similar but more comprehensive content as the curriculum-sampling test. During the first half of the first semester, students attended non-mandatory lectures, studied a book, and took a multiple-choice exam about the material. The exam was graded on a scale ranging from 1 to 10. In each cohort, some students did not participate in this exam, leading to missing values (2%, 3%, and 4%, respectively). The missing values were handled by listwise deletion in all analyses8. The exact sample sizes for all variables in each

sample are shown in Table 4.1 (column 5). First year GPA

First year GPA (FYGPA) was the mean grade obtained after one academic year; there were 10 course grades when a student completed all courses. Grades were given on a scale from 1 to 10, with a 6 or higher representing a pass. For most courses, literature had to be studied on psychological or methodological topics, supplemented with non-compulsory lectures, and assessed through a final multiple-choice exam. For analyses including both the first grade and FYGPA, the grade on the first course was excluded from FYGPA to avoid inflation of the validity coefficients.

Third year GPA

Third year GPA (TYGPA) was available for 493 participants from the first sample, the other students dropped out of the program. To avoid artificially high

correlations between FYGPA and TYGPA, TYGPA was defined as the mean grade obtained in the second and the third academic year. The number of courses completed by each student varied, but students were expected to complete the

7 The applicants who were present at the test day scored slightly higher than the online test-takers in

each cohort, so it seems unlikely that cheating was a major issue for the online test-takers.

8 Although there are more refined ways to handle these missing values, we used this method to be

consistent across all analyses. Using more refined methods using the BayesFactor R package is not straightforward.

undergraduate program within three years. The courses in the second year were mostly the same for all students, whereas the third year consisted of mostly elective courses in subdisciplines of psychology.

4.2.3 Frequentist and Bayesian Approach

There were several reasons to supplement the classical frequentist analyses with a Bayesian approach (e.g., Gelman et al., 2014; Kruschke, Aguinis, & Joo, 2012). First, there are some shortcomings of the classical step-down regression analysis (Lautenschlager & Mendoza, 1986) to study differential prediction (Aguinis et al., 2010, Berry, 2015; Meade & Fetzer, 2009). Tests for slope differences tend to be underpowered, even in large samples, and tests for intercept differences tend to have inflated Type I errors (Aguinis et al., 2010). There have been suggestions to overcome these problems (Aguinis et al., 2010, Berry, 2015; Mattern & Patterson, 2013; Meade & Fetzer, 2009), but most suggestions are difficult to implement, especially when slope differences are also present (Berry, 2015). A Bayesian approach does not solve all these problems, but inconclusive results can be distinguished from evidence in favor of the null hypothesis of no differential prediction. Second, the Bayesian approach provides comprehensive tools for parameter estimation and hypothesis testing (e.g., Gelman et al., 2014; Kruschke, Aguinis, Joo, 2012). Through Bayesian statistics probabilities for model parameters can be computed after observing the data, thus p(theory|data) can be computed. Under the classical frequentist framework, a researcher usually computes the probability of observing the data at hand or more extreme given that the model under consideration holds, that is, p(data|theory). Most researchers are, however, interested in assessing the plausibility of research hypotheses based on the observed data or more extreme data. In that case, Bayesian statistics typically provides direct answers. Under the frequentist approach, we cannot compute p(theory|data) because theories have no stochastic properties, only data do. A third reason for using a Bayesian approach is that it does not capitalize on issues such as dependence on unobserved data, subjective stopping data collection rules (i.e., continuing data collection until a certain result is achieved), multiple testing, and lack of support towards the null hypothesis (Gelman et al., 2014;

Wagenmakers, 2007).

In the present study, the Bayesian approach thus has the advantage that when we use different types of curriculum samples as predictors we can investigate whether the data are more in agreement with the hypothesis that no differential prediction occurs (i.e., the null hypothesis). In addition, contrary to well-known confidence intervals used in the frequentist approach, credible intervals (CIs) based on Bayesian analyses can be interpreted as the most probable values of a parameter

(10)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 73PDF page: 73PDF page: 73PDF page: 73

university, they took a multiple-choice exam about the material. A course

instructor constructed the exam. Each year, the exams consisted of different items (40 items in 2013 and 2014, 39 items in 2015); the estimated reliability of the tests was α = .81 in 2013, α = .82 in 2014, and α = .76 in 2015. International students could request to take the exams online7. For more details about the admission

procedure see Niessen et al. (2016). First grade

The first grade obtained in the program qualifies as the result of a more

comprehensive curriculum sample than the admission test. This first grade was the grade obtained at the first exam of the course Introduction to Psychology (not including resit scores), this course covered similar but more comprehensive content as the curriculum-sampling test. During the first half of the first semester, students attended non-mandatory lectures, studied a book, and took a multiple-choice exam about the material. The exam was graded on a scale ranging from 1 to 10. In each cohort, some students did not participate in this exam, leading to missing values (2%, 3%, and 4%, respectively). The missing values were handled by listwise deletion in all analyses8. The exact sample sizes for all variables in each

sample are shown in Table 4.1 (column 5). First year GPA

First year GPA (FYGPA) was the mean grade obtained after one academic year; there were 10 course grades when a student completed all courses. Grades were given on a scale from 1 to 10, with a 6 or higher representing a pass. For most courses, literature had to be studied on psychological or methodological topics, supplemented with non-compulsory lectures, and assessed through a final multiple-choice exam. For analyses including both the first grade and FYGPA, the grade on the first course was excluded from FYGPA to avoid inflation of the validity coefficients.

Third year GPA

Third year GPA (TYGPA) was available for 493 participants from the first sample, the other students dropped out of the program. To avoid artificially high

correlations between FYGPA and TYGPA, TYGPA was defined as the mean grade obtained in the second and the third academic year. The number of courses completed by each student varied, but students were expected to complete the

7 The applicants who were present at the test day scored slightly higher than the online test-takers in

each cohort, so it seems unlikely that cheating was a major issue for the online test-takers.

8 Although there are more refined ways to handle these missing values, we used this method to be

consistent across all analyses. Using more refined methods using the BayesFactor R package is not straightforward.

undergraduate program within three years. The courses in the second year were mostly the same for all students, whereas the third year consisted of mostly elective courses in subdisciplines of psychology.

4.2.3 Frequentist and Bayesian Approach

There were several reasons to supplement the classical frequentist analyses with a Bayesian approach (e.g., Gelman et al., 2014; Kruschke, Aguinis, & Joo, 2012). First, there are some shortcomings of the classical step-down regression analysis (Lautenschlager & Mendoza, 1986) to study differential prediction (Aguinis et al., 2010, Berry, 2015; Meade & Fetzer, 2009). Tests for slope differences tend to be underpowered, even in large samples, and tests for intercept differences tend to have inflated Type I errors (Aguinis et al., 2010). There have been suggestions to overcome these problems (Aguinis et al., 2010, Berry, 2015; Mattern & Patterson, 2013; Meade & Fetzer, 2009), but most suggestions are difficult to implement, especially when slope differences are also present (Berry, 2015). A Bayesian approach does not solve all these problems, but inconclusive results can be distinguished from evidence in favor of the null hypothesis of no differential prediction. Second, the Bayesian approach provides comprehensive tools for parameter estimation and hypothesis testing (e.g., Gelman et al., 2014; Kruschke, Aguinis, Joo, 2012). Through Bayesian statistics probabilities for model parameters can be computed after observing the data, thus p(theory|data) can be computed. Under the classical frequentist framework, a researcher usually computes the probability of observing the data at hand or more extreme given that the model under consideration holds, that is, p(data|theory). Most researchers are, however, interested in assessing the plausibility of research hypotheses based on the observed data or more extreme data. In that case, Bayesian statistics typically provides direct answers. Under the frequentist approach, we cannot compute p(theory|data) because theories have no stochastic properties, only data do. A third reason for using a Bayesian approach is that it does not capitalize on issues such as dependence on unobserved data, subjective stopping data collection rules (i.e., continuing data collection until a certain result is achieved), multiple testing, and lack of support towards the null hypothesis (Gelman et al., 2014;

Wagenmakers, 2007).

In the present study, the Bayesian approach thus has the advantage that when we use different types of curriculum samples as predictors we can investigate whether the data are more in agreement with the hypothesis that no differential prediction occurs (i.e., the null hypothesis). In addition, contrary to well-known confidence intervals used in the frequentist approach, credible intervals (CIs) based on Bayesian analyses can be interpreted as the most probable values of a parameter

73

(11)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 74PDF page: 74PDF page: 74PDF page: 74

given the data (e.g., Kruschke & Liddell, 2017). We, therefore, decided to use Bayesian techniques in our analyses and we compared the frequentist results with the Bayesian results.

4.2.4 Analyses

Means, standard deviations, corresponding effect sizes for differences between means, and zero-order correlations between relevant variables were calculated for all samples using a frequentist and a Bayesian approach. For each predictor-criterion combination we conducted several step-down hierarchical regression analyses (Lautenschlager & Mendoza, 1986), which is a commonly used and recommended approach to differential prediction analysis (Aguinis et al., 2010). This procedure starts with an omnibus test that is used to compare a simple regression model that only includes the main continuous predictor (the curriculum-sampling test score, first course grade, or FYGPA) with a regression model that includes the continuous predictor, gender, and a predictor-gender interaction term. If the result of the omnibus test is statistically significant (i.e., indicative of differential prediction), subsequent sequential tests of slope differences and intercept differences are conducted. Slope differences are determined through testing a regression model including the first-order

continuous predictor and gender against a full regression model also including an interaction term. When slope differences are detected, intercept differences are assessed through testing a regression model that includes the continuous predictor and the predictor-gender interaction term against a full model also including the gender main effect. When slope differences are not detected, intercept differences are assessed by testing a model with the continuous predictor and gender against a model including only the continuous predictor. To reduce multicollinearity, the independent variables were centered around their means before analyses were conducted. Because we examined two predictor-criterion combinations in three samples and another combination in one sample (FYGPA-TYGPA), seven step-down regression analyses were conducted with both approaches.

Frequentist analyses

For the frequentist analyses, an alpha level of .05 was chosen as the significance level for testing the increment in explained variance at each step. In addition when slope differences were detected, the nature of these differences was studied by applying the Johnson-Neyman technique (Johnson & Neyman, 1936) using the omnibus groups regions of significance (OGRS) macro for SPSS (Hayes & Montoya, 2017). This analysis shows regions of significance, that is, the range of values of the predictor variable that show differences in the criterion variable between men and women. This technique provides insight into whether differential prediction

occurred within the operational score range. The frequentist analyses were conducted in SPSS version 24.0.

Bayesian analyses

For the Bayesian analyses, the Bayes factor (Kass & Raftery, 1995) was used as a measure of evidence in favor of differential prediction at each step in the

regression analyses (Lautenschlager and Mendoza, 1986). The Bayes factor shows the weight of evidence in the data for competing hypotheses, or the degree to which one hypothesis predicts the observed data better than the other. For example, a Bayes factor of H1 against H0 of 3 (denoted BF10 = 3) means that the

empirical data is 3 times more likely to occur under H1 than under H0; BF = 1

means that the empirical data are equally likely under both hypotheses (e.g., Gelman et al., 2014; Kass & Raftery, 1995). To interpret the Bayes factors we used the benchmarks proposed by Kass and Raftery (1995, p. 777)9. The Bayesian

analyses were conducted using the R package BayesFactor (Morey & Rouder, 2015) to compute the Bayes factors, and using JAGS, version 4.2.0 (Plummer, 2016a) in R, with the package rjags, version 4.6 (Plummer, 2016b) for model estimation.

Bayesian analysis starts by specifying a prior distribution for the parameters. After data collection, a posterior distribution combining information from the data and the prior is computed. Posterior distributions cannot be calculated directly, so the posterior distribution is approximated based on Markov Chain Monte Carlo (MCMC) sampling (for details, see Kruschke et al., 2012). The default priors provided in the BayesFactor R package were used to compute the Bayes factors. For model estimation, we used broad priors: a normal prior on the standardized regression coefficients with a mean of zero and a standard deviation of 100, and a uniform prior on the residual variance ranging from zero to ten. The standardized regression coefficients were transformed back to the original scale. We used 1,000 iterations to tune the samplers and 1,000 burn-in iterations before running four MCMC chains of 10,000 iterations each. Convergence of the MCMC iterations (Gelman-Rubin's convergence diagnostic) and effective sample size were inspected and no problems were detected.

9 BF = 1-3 or 0.33-1: anecdotal evidence, BF = 3-20 or 0.33: positive evidence, BF = 20-150 or

0.05-0.007: strong evidence, BF = > 150 or < 0.05-0.007: very strong evidence. Numbers larger than 1 indicate evidence in favor of the alternative hypothesis and numbers smaller than 1 indicate evidence in favor of the null hypothesis.

(12)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 75PDF page: 75PDF page: 75PDF page: 75

given the data (e.g., Kruschke & Liddell, 2017). We, therefore, decided to use Bayesian techniques in our analyses and we compared the frequentist results with the Bayesian results.

4.2.4 Analyses

Means, standard deviations, corresponding effect sizes for differences between means, and zero-order correlations between relevant variables were calculated for all samples using a frequentist and a Bayesian approach. For each predictor-criterion combination we conducted several step-down hierarchical regression analyses (Lautenschlager & Mendoza, 1986), which is a commonly used and recommended approach to differential prediction analysis (Aguinis et al., 2010). This procedure starts with an omnibus test that is used to compare a simple regression model that only includes the main continuous predictor (the curriculum-sampling test score, first course grade, or FYGPA) with a regression model that includes the continuous predictor, gender, and a predictor-gender interaction term. If the result of the omnibus test is statistically significant (i.e., indicative of differential prediction), subsequent sequential tests of slope differences and intercept differences are conducted. Slope differences are determined through testing a regression model including the first-order

continuous predictor and gender against a full regression model also including an interaction term. When slope differences are detected, intercept differences are assessed through testing a regression model that includes the continuous predictor and the predictor-gender interaction term against a full model also including the gender main effect. When slope differences are not detected, intercept differences are assessed by testing a model with the continuous predictor and gender against a model including only the continuous predictor. To reduce multicollinearity, the independent variables were centered around their means before analyses were conducted. Because we examined two predictor-criterion combinations in three samples and another combination in one sample (FYGPA-TYGPA), seven step-down regression analyses were conducted with both approaches.

Frequentist analyses

For the frequentist analyses, an alpha level of .05 was chosen as the significance level for testing the increment in explained variance at each step. In addition when slope differences were detected, the nature of these differences was studied by applying the Johnson-Neyman technique (Johnson & Neyman, 1936) using the omnibus groups regions of significance (OGRS) macro for SPSS (Hayes & Montoya, 2017). This analysis shows regions of significance, that is, the range of values of the predictor variable that show differences in the criterion variable between men and women. This technique provides insight into whether differential prediction

occurred within the operational score range. The frequentist analyses were conducted in SPSS version 24.0.

Bayesian analyses

For the Bayesian analyses, the Bayes factor (Kass & Raftery, 1995) was used as a measure of evidence in favor of differential prediction at each step in the

regression analyses (Lautenschlager and Mendoza, 1986). The Bayes factor shows the weight of evidence in the data for competing hypotheses, or the degree to which one hypothesis predicts the observed data better than the other. For example, a Bayes factor of H1 against H0 of 3 (denoted BF10 = 3) means that the

empirical data is 3 times more likely to occur under H1 than under H0; BF = 1

means that the empirical data are equally likely under both hypotheses (e.g., Gelman et al., 2014; Kass & Raftery, 1995). To interpret the Bayes factors we used the benchmarks proposed by Kass and Raftery (1995, p. 777)9. The Bayesian

analyses were conducted using the R package BayesFactor (Morey & Rouder, 2015) to compute the Bayes factors, and using JAGS, version 4.2.0 (Plummer, 2016a) in R, with the package rjags, version 4.6 (Plummer, 2016b) for model estimation.

Bayesian analysis starts by specifying a prior distribution for the parameters. After data collection, a posterior distribution combining information from the data and the prior is computed. Posterior distributions cannot be calculated directly, so the posterior distribution is approximated based on Markov Chain Monte Carlo (MCMC) sampling (for details, see Kruschke et al., 2012). The default priors provided in the BayesFactor R package were used to compute the Bayes factors. For model estimation, we used broad priors: a normal prior on the standardized regression coefficients with a mean of zero and a standard deviation of 100, and a uniform prior on the residual variance ranging from zero to ten. The standardized regression coefficients were transformed back to the original scale. We used 1,000 iterations to tune the samplers and 1,000 burn-in iterations before running four MCMC chains of 10,000 iterations each. Convergence of the MCMC iterations (Gelman-Rubin's convergence diagnostic) and effective sample size were inspected and no problems were detected.

9 BF = 1-3 or 0.33-1: anecdotal evidence, BF = 3-20 or 0.33: positive evidence, BF = 20-150 or

0.05-0.007: strong evidence, BF = > 150 or < 0.05-0.007: very strong evidence. Numbers larger than 1 indicate evidence in favor of the alternative hypothesis and numbers smaller than 1 indicate evidence in favor of the null hypothesis.

75

(13)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 76PDF page: 76PDF page: 76PDF page: 76

4.3 Results

Table 4.1 (shown on the next page) shows descriptive statistics for the curriculum-sampling test, the first course grade, FYGPA, and TYGPA in each cohort for men and women, and effect sizes for the difference in scores between men and women based on frequentist and Bayesian analyses. All differences were small and most were not statistically significant. When we inspect the Bayes factors, there was positive evidence (Kass & Raftery, 1995) that men performed better than women on the curriculum-sampling test in 2015 (BF10 = 4.67), but all credible values for

the effect size of the difference were small (95% CI [.08, .45]).There was anecdotal evidence that women performed better than men in the first course in 2013 (BF10 = 1.12) and strong evidence that women performed better than men in the

first year in 2013 (BF10 = 153.10), both with small credible effect sizes (95% CI [-.39, -.02] and [-.50, -.16], respectively).

Table 4.2 shows the zero-order correlations for each predictor-criterion combination in each cohort. There was perfect correspondence between the frequentist and the Bayesian analyses, so they are not depicted and discussed separately. The curriculum-sampling tests showed moderate to large correlations with FYGPA, the first grade showed large correlations with FYGPA, and FYGPA showed large correlations with TYGPA. These results were consistent across cohorts.

Table 4.2

Zero-order correlations between all variables per sample.

Predictor-criterion

combination 2013 Sample 2014 2015 Cur. sample - FYGPA .49

[.43, .55]

.45

[.39, .51]

.44

[.37, .51]

First grade - FYGPAa .75 [.72, .78] .70 [.66, .74] .68 [.63, .72] FYGPA - TYGPAb .78 [.74, .81] - -

Note. a For this correlation, the first grade was excluded from the FYGPA calculation. b For this

correlation, first year results were excluded from third year results. The same results were obtained using the frequentist and Bayesian analyses, so they are not depicted separately. 95% credible intervals are between brackets. All correlations were statistically significant

with p < .01. Ta bl e 4 .1 M ean s, st an dar d devi at ion s, an d gen der di ffer en ces . Va ri ab le Sam pl e Ov er al l M en W om en d σ BF10 M SD n M SD M SD Cur . s am pl e 20 13 29 .7 3 5. 16 63 8 29 .3 2 5. 50 29 .9 1 5. 00 - .1 1 [-.28, .06] -.1 1 [-.27, .06] 0. 22 20 14 29 .9 0 5. 45 63 5 30 .0 1 5. 58 29 .8 5 5. 38 .0 3 [-.14, .19] .0 3 [-.13, .19] 0. 10 20 15 29 .1 9 4. 73 53 1 30 .0 7 3. 95 28 .8 2 4. 98 .27* [.08, .45] .2 6 [.08, .44] 4. 67 Fi rs t gr ade 20 13 6. 63 1. 37 62 5 6. 45 1. 49 6. 71 1. 31 -.19* [-.36, . -02] -.1 9 [-.36, -.02] 1. 12 20 14 6. 82 1. 53 61 4 6. 88 1. 55 6. 79 1. 53 .0 6 [-.11, .23] .0 6 [-.11, .22] 0. 12 20 15 6. 34 1. 42 51 1 6. 42 1. 55 6. 31 1. 36 .0 8 [-.11, .27] .0 7 [-.11, .26] 0. 15 FYGPA 20 13 6. 63 1. 30 63 8 6. 33 1. 46 6. 76 1. 20 -.34* [-.50, . -17] -.3 3 [-.50, -.16] 15 3. 10 20 14 6. 45 1. 35 63 5 6. 33 1. 37 6. 50 1. 34 -.1 3 [-.29, .04] -.1 2 [-.29, .04] 0. 28 20 15 6. 64 1. 25 53 1 6. 60 1. 36 6. 66 1. 20 -.0 5 [-.23, .14] -.0 4 [-.23, .14] 0. 12 TYGPA a 20 13 6. 94 1. 00 49 3 6. 81 1. 13 6. 99 0. 94 -.1 8 [-.38, .02] -.1 7 [-.36, .02] 0. 53 N ote . a fir st y ea r r es ult s w er e e xc lu de d fr om th ir d y ea r r es ult s. d = C oh en ’s d c or re ct ed for di ffe re nc es in sam pl e si ze (al so re fe rr ed to as H edge ’s g) , 95% conf ide nc e int er val s ar e be tw ee n bra ck ets. σ is th e es tim at ed ef fect si ze ba sed on th e Ba yes ia n ana lys is , 9 5% cr edi bl e in ter va ls a re bet w een br ack et s. BF10 show s t he Bayes fac tor for the evi den ce in favor of the al ter nat ive hyp ot hes is rel at ive to the nu ll hyp ot he si s. M en w er e code d 0, w om en w ere co ded 1 . * p < .0 5 76

(14)

515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen 515949-L-bw-niessen Processed on: 5-1-2018 Processed on: 5-1-2018 Processed on: 5-1-2018

Processed on: 5-1-2018 PDF page: 77PDF page: 77PDF page: 77PDF page: 77

4.3 Results

Table 4.1 (shown on the next page) shows descriptive statistics for the curriculum-sampling test, the first course grade, FYGPA, and TYGPA in each cohort for men and women, and effect sizes for the difference in scores between men and women based on frequentist and Bayesian analyses. All differences were small and most were not statistically significant. When we inspect the Bayes factors, there was positive evidence (Kass & Raftery, 1995) that men performed better than women on the curriculum-sampling test in 2015 (BF10 = 4.67), but all credible values for

the effect size of the difference were small (95% CI [.08, .45]).There was anecdotal evidence that women performed better than men in the first course in 2013 (BF10 = 1.12) and strong evidence that women performed better than men in the

first year in 2013 (BF10 = 153.10), both with small credible effect sizes (95% CI [-.39, -.02] and [-.50, -.16], respectively).

Table 4.2 shows the zero-order correlations for each predictor-criterion combination in each cohort. There was perfect correspondence between the frequentist and the Bayesian analyses, so they are not depicted and discussed separately. The curriculum-sampling tests showed moderate to large correlations with FYGPA, the first grade showed large correlations with FYGPA, and FYGPA showed large correlations with TYGPA. These results were consistent across cohorts.

Table 4.2

Zero-order correlations between all variables per sample.

Predictor-criterion

combination 2013 Sample 2014 2015 Cur. sample - FYGPA .49

[.43, .55]

.45

[.39, .51]

.44

[.37, .51]

First grade - FYGPAa .75 [.72, .78] .70 [.66, .74] .68 [.63, .72] FYGPA - TYGPAb .78 [.74, .81] - -

Note. a For this correlation, the first grade was excluded from the FYGPA calculation. b For this

correlation, first year results were excluded from third year results. The same results were obtained using the frequentist and Bayesian analyses, so they are not depicted separately. 95% credible intervals are between brackets. All correlations were statistically significant

with p < .01. Ta bl e 4 .1 M ean s, st an dar d devi at ion s, an d gen der di ffer en ces . Va ri ab le Sam pl e Ov er al l M en W om en d σ BF10 M SD n M SD M SD Cur . s am pl e 20 13 29 .7 3 5. 16 63 8 29 .3 2 5. 50 29 .9 1 5. 00 - .1 1 [-.28, .06] -.1 1 [-.27, .06] 0. 22 20 14 29 .9 0 5. 45 63 5 30 .0 1 5. 58 29 .8 5 5. 38 .0 3 [-.14, .19] .0 3 [-.13, .19] 0. 10 20 15 29 .1 9 4. 73 53 1 30 .0 7 3. 95 28 .8 2 4. 98 .27* [.08, .45] .2 6 [.08, .44] 4. 67 Fi rs t gr ade 20 13 6. 63 1. 37 62 5 6. 45 1. 49 6. 71 1. 31 -.19* [-.36, . -02] -.1 9 [-.36, -.02] 1. 12 20 14 6. 82 1. 53 61 4 6. 88 1. 55 6. 79 1. 53 .0 6 [-.11, .23] .0 6 [-.11, .22] 0. 12 20 15 6. 34 1. 42 51 1 6. 42 1. 55 6. 31 1. 36 .0 8 [-.11, .27] .0 7 [-.11, .26] 0. 15 FYGPA 20 13 6. 63 1. 30 63 8 6. 33 1. 46 6. 76 1. 20 -.34* [-.50, . -17] -.3 3 [-.50, -.16] 15 3. 10 20 14 6. 45 1. 35 63 5 6. 33 1. 37 6. 50 1. 34 -.1 3 [-.29, .04] -.1 2 [-.29, .04] 0. 28 20 15 6. 64 1. 25 53 1 6. 60 1. 36 6. 66 1. 20 -.0 5 [-.23, .14] -.0 4 [-.23, .14] 0. 12 TYGPA a 20 13 6. 94 1. 00 49 3 6. 81 1. 13 6. 99 0. 94 -.1 8 [-.38, .02] -.1 7 [-.36, .02] 0. 53 N ote . a fir st y ea r r es ult s w er e e xc lu de d fr om th ir d y ea r r es ult s. d = C oh en ’s d c or re ct ed for di ffe re nc es in sam pl e si ze (al so re fe rr ed to as H edge ’s g) , 95% conf ide nc e int er val s ar e be tw ee n bra ck ets. σ is th e es tim at ed ef fect si ze ba sed on th e Ba yes ia n ana lys is , 9 5% cr edi bl e in ter va ls a re bet w een br ack et s. BF10 show s t he Bayes fac tor for the evi den ce in favor of the al ter nat ive hyp ot hes is rel at ive to the nu ll hyp ot he si s. M en w er e code d 0, w om en w ere co ded 1 . * p < .0 5 77

4

Referenties

GERELATEERDE DOCUMENTEN

We examined (1) to what extent self-presentation behavior occurred, (2) the effect of self-presentation on the predictive validity of the self-reported noncognitive predictors,

We hypothesized that interviews and high-fidelity methods like curriculum-sampling tests and skills tests would be perceived as most favorable, followed by cognitive ability

In the Naylor-Shine model, utility is not defined in terms of the increase in the success ratio, but in terms of the increase in mean criterion performance (for example, final GPA),

The central idea of using these measures on top of academic measures like high school grade point average (GPA) and standardized test scores (e.g., MCAT scores) is that these

To guide our discussion, we distinguish the following topics: (a) the types of outcomes that are predicted, (b) broader admission criteria as predictors, (c) adverse impact and

Differential prediction and bias of high school GPA was not studied in this thesis, but previous research found that high school grades showed some underprediction of female

In hoofdstuk 2 wordt een onderzoek beschreven naar de predictieve validiteit van verschillende tests die werden gebruikt bij de selectie van studenten voor een

We investigated the predictive validity of a curriculum-sampling test, based on a performance- sampling approach analogous to work samples, and two specific skills tests for