• No results found

Group differences in intelligence : Spearman’s hypothesis or Spearman’s law?

N/A
N/A
Protected

Academic year: 2021

Share "Group differences in intelligence : Spearman’s hypothesis or Spearman’s law?"

Copied!
133
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Group differences in intelligence:

Spearman’s hypothesis or Spearman’s law?

Michael van den Hoek (6057675)

Master thesis

University of Amsterdam

Work and Organizational Psychology Supervisor: Dr. Jan te Nijenhuis

Second Reviewer: Dr. Annelies van Vianen March 2015

(2)

1 Acknowledgement

I would like to thank Dr. Jan te Nijenhuis for overseeing my thesis progress. Secondly, I would like to thank the people who supplied invaluable datasets for this project, namely: Elijah Armstrong, Prateek Yadav, Dr. Conor Dolan, Dr. Alsedig Abdalgadr Al-Shahomee, Dr. Andrei Grigoriev, Dr. Jüri Allik, Dr. Nermin Djapo, Dr. Thomas Coyle, Dr. Heiner

Rindermann, Dr. Leandro Almeida, Dr. Gena Lemos, and Dr. Richard Lynn. Thirdly, I would like to thank Dr. Henk van der Flier for answering some fundamental methodological

questions. Fourthly, I would like to thank Dr. Alex Beaujean for his feedback on parts of this project. Lastly, I want to thank my girlfriend, Shao Yeung, for her patience with me during this project.

Special notes and acknowledgements

It is important to note that my thesis is one of many in an ongoing meta-analytical project. Most of these theses used psychometric meta-analytical methods and the method of correlated vectors to explore various research topics, especially intelligence and general factor of personality in relation to other variables. Since much of my work builds upon the findings of my predecessors, I would like to thank and acknowledge them as well: Birthe Jongeneel-Grimen, Rosina van Bloois, Lise-Lotte Geutjes, Jan Smit, Joep Dragt, Dennys Franssen, Jasper Repko, Denise Willigers, Evelien van Meerveld, Daniel Metzen, and Esteban van der Boor. Having built on some of their ideas and methods, I have tried my utmost to give due credit for the efforts of my predecessors whenever I made use of some of their work or ideas in my thesis. This was done by adding a reference directly, or by adding one at the end of any paragraph that borrows even slightly from the ideas of my predecessors. Furthermore, the general reporting of the methods used and results found in this paper were strongly comparable to the format of my predecessors, since this is the common way to report on analyses and outcomes of this nature. The formats and wording for most tables, the sections on the searching and screening of studies, calculating d for groups, and reporting of the results from the meta-analysis are mostly taken and adapted from te Nijenhuis and Dragt (2010), te Nijenhuis and Franssen (2010), and te Nijenhuis and Metzen (2012). Similar to te Nijenhuis and Metzen (2012), when I took something verbatim, adapted, or strongly based something from the work or methodology of my predecessors it will be noted with a footnote which will include a reference to the source.

(3)

2 Articles Resulting From This Master Thesis

1. te Nijenhuis, J., van den Hoek, M., & Armstrong, E. (in press). Spearman’s hypothesis and Amerindians: A meta-analysis. Intelligence.

2. te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Allik, J., Grigoriev, A., & Dragt, J. (in press). Spearman’s hypothesis tested comparing Libyan secondary school children with various other groups of secondary school children on the items of the Standard Progressive Matrices.

3. te Nijenhuis, J., Al-Shahomee, A. A., van den Hoek, M., Grigoriev, A., & Repko, J. (in press). Spearman’s hypothesis tested comparing Libyan adults with various other groups of adults on the items of the Standard Progressive Matrices. Intelligence.

4. te Nijenhuis, J., van den Hoek, M., Beaujean, A.A., Spanoudis, G. (in preparation)

Spearman’s hypothesis tested comparing Libyan primary school children with various other groups on the items of the Standard Progressive Matrices.

(4)

3 Table of contents

Acknowledgement...1

Articles Resulting From This Master Thesis...2

Table of Contents...3

Abstract...7

Introduction...9

Intelligence testing and group differences...9

Factors of intelligence and g...11

Spearman’s hypothesis and the “Jensen effect”...12

Spearman’s hypothesis tested on different ethnic groups...13

Purpose of the study...13

General Method...15

Methods of meta-analysis and correcting for artifacts...15

Correcting for unequal group sizes in a data point...16

Method of Correlated Vectors...17

Testing Spearman’s hypothesis...18

Criteria for confirmation of Spearman’s hypothesis...19

Choice of SD used in calculating the d difference scores...20

Selecting a g loading for calculating r (d x g)...20

Selecting comparison groups...21

Language bias in tests...22

Basic inclusion criteria...23

Descriptions of tests used in studies...23

Study 1: Spearman’s Hypothesis tested on Twin Data using a large number of subtests...31

Method...32

Participants...32

Instruments...32

(5)

4

Results...35

Conclusion...36

Study 2: Spearman’s hypothesis tested on working-age Black adults: A meta-analysis...38

Method...39

Searching and screening studies...39

Instruments...40

Calculating d for Black and White adults...40

Results...41

Conclusion...42

Study 3: Spearman’s hypothesis tested on Amerindians: A meta-analysis...44

Method...45

Searching and screening studies...45

Calculating d for Amerindians...46

Results...46

Conclusion...49

Study 4: Spearman’s hypothesis tested on Hispanics: A meta-analysis...52

Method...54

Searching and screening studies...54

Calculating d for Hispanics...54

Corrections for language bias...55

Results...55

Conclusion...58

Study 5: Spearman’s hypothesis tested on non-traditional cognitive batteries: A meta-analysis...59

Method...61

Searching and screening studies...61

Specific inclusion criteria...61

Instruments and samples...62

Calculating d between groups...63

Corrections for language bias...64

Results...64

(6)

5 Study 6: Non-confirmation of Spearman’s hypothesis for Black and White prisoners and

Northeast Asians?: Two meta-analyses...67

Black and White prisoners...68

Method...69

Searching and screening studies...69

Calculating d for Black and White prisoners...69

Instruments...69

Results...70

Conclusion...72

The intelligence of Northeast Asians...73

Method...75

Searching and screening studies...75

Specific criteria for inclusion...75

Calculating d for Northeast Asians...75

Results...76

Conclusion...79

Study 7: Spearman’s hypothesis tested between regions in Portugal, Spain, Italy, the U.S., and India: A meta-analysis...81

Method...83

Searching and screening studies...83

Instruments...83

Calculating d for regions...84

Determining Nharmonic for PISA and NAS data...84

Results...85

Conclusion...88

Study 8: Spearman’s hypothesis tested between countries...90

PISA...90

TIMSS...91

PIRLS...91

(7)

6

Instruments...92

Score corrections for specific countries...94

Corrections for school attendance and age...95

Calculating d for PISA, TIMMS, and PIRLS...95

g loadings for PISA, TIMMS, and PIRLS data...95

Results...96

Conclusion...98

Study 9: Spearman’s hypothesis tested on Libyans using the Standard Progressive Matrices: A meta-analysis...100

Method...101

Searching and screening for studies...101

Calculating d scores for the Standard Progressive Matrices...102

Calculating the g loadings for the Standard Progressive Matrices...102

Matching different samples for comparison...103

Description of Libyan data...103

Description of comparison groups...104

Results...107

Conclusion...109

General Discussion...110

Limitations of the Studies...115

Practical Implications...115

Conclusion...116

(8)

7 Abstract

Spearman’s hypothesis posits that group differences are mostly explained by

differences in general intelligence, or g. In this study we tested Spearman’s hypothesis on the intelligence differences between several groups, as well as on the differences between regions and countries, to see if these differences are due to differences in g.

Using the method of correlated vectors we tested Spearman’s hypothesis on Black and White twins, Black and White adults, Black and White prisoners, Amerindians, Hispanics, Northeast Asians, as well as several different ethnic groups using non-verbal and culture-reduced tests. Furthermore, we also tested Spearman’s hypothesis between regions in the U.S., Portugal, Spain, Italy, and India, between Libya and several other countries, and between many countries using achievement test data. When using the method of correlated vectors, a strong correlation between the differences in subtests scores and g loadings of the subtest is considered a confirmation of Spearman’s hypothesis.

We found strong confirmation of Spearman’s hypothesis for Black and White twins (r = .69, Nharmonic = 837), Black and White adults (r = .57, k = 15, Nharmonic = 142,857), and

Hispanics (r = .73, k = 16, Nharmonic = 57,806), after correcting for language bias in the

Hispanic samples. For Amerindians we found that Spearman’s hypothesis was strongly confirmed on both the entire test batteries and verbal subtests of test batteries, but we did not find a confirmation on the non-verbal or performance subtests of test batteries (r = .62, k = 16, Nharmonic = 5,267). We also found that Spearman’s hypothesis still holds when using

non-traditional intelligence tests, but only when they are corrected for language bias and exclude Asian samples (r = .61, k = 17, Nharmonic = 6,177). Furthermore, we did not find a strong

confirmation of Spearman’s hypothesis for Black and White prisoners (rtotal = -.18, rverbal =

-.41, rperformance = .43, k = 2, Nharmonic = 506) or Northeast Asians (rtotal = .01, rverbal = -.27,

rperformance = .44, k = 15, Nharmonic = 5,418) on either the entire test batteries or the

Verbal/Performance subscales.

For the differences in intelligence between regions within countries, we did not find a confirmation of Spearman’s hypothesis for any of the countries in question (r = .28, k = 100 (with 80 regions), Nharmonic = 193,820). However, the average correlation is still positive albeit

(9)

8 country to all lower-scoring countries and then averaging the correlations (r = .04, k = 2,385 separate tests of Spearman’s hypothesis using 79 countries). The last test of Spearman’s hypothesis for the differences between Libya and several other countries found substantial correlations between d and g when comparing Libya to other countries (r = .75, k = 12, Nharmonic = 9,005).

Overall it seems that Spearman’s hypothesis remains strongly confirmed when comparing different ethnic groups to each other, with the exception of Northeast Asians and Black/White prisoners. The confirmation for most groups besides Asians remains true even when using culturally reduced or non-verbal tests of intelligence. The differences between regions and countries do not appear to be Jensen effect. However, the exception to this seems to be when countries with strongly different ethnicities are compared, in this case Whites compared to Libyan Arabs and Berbers. The implications and limitations of these findings are fully discussed in the general discussion.

(10)

9 Introduction

Intelligence testing has been around for more than a century now, and has come a long way from its humble beginnings as a mere diagnostic tool, based on limited data, to the versatile, commonplace, and well researched tool it is today. In modern times intelligence tests are commonly used to decide upon, or at least inform the decision of the placement of students and it is used for the optimal selection of personnel in work settings. This is done for good reason: those that score higher tend to perform better in many different aspects of their life, such as work and education. For example, intelligence is a strong predictor for scholarly achievement (Deary, Strand, Smith, & Fernandes, 2007) as well as many different work performance outcomes and occupational attainments (e.g. Hunter, 1986; Schmidt & Hunter, 2004). Intelligence is not only important for individual outcomes, but research has shown that the average intelligence of a whole country bears a strong relation with the overall wealth and well-being of that country as well: more cognitively capable countries are found to have a higher gross domestic product (GDP) compared to less cognitively capable nations

(Dickerson, 2006; Hunt, & Wittmann, 2008). (te Nijenhuis & Franssen, 2010; te Nijenhuis & Repko, 2011)

The correlation between intelligence and work performance seems to be mainly based on the general mental ability (g) of people, rather than their IQ scores which can be

considered both a measure of g and skills and/or other variables (e.g. Ree, Earles, & Teachout, 1994). This means that people who score higher on general intelligence, g, are likely to

perform better at work. Furthermore, it also means that the quality of intelligence tests as a predictor for work performance and a selection tool is based on how well it measures g and how well it is able to differentiate between people on this variable.

Intelligence testing and group differences

There is much dispute as to the justifiability of using intelligence test results as predictors for performance of minorities and/or non-Whites. A common claim is that

intelligence tests are biased against ethnic minorities, immigrants, and non-Whites, because they usually obtain lower scores on many different tests of cognitive abilities (e.g. te

Nijenhuis, Evers, & Mur, 2000; te Nijenhuis & van der Flier, 1997). It is important to mention that there are many sources of possible bias that are claimed to be responsible for these

(11)

10 differences. Common examples of such biases are stereotype threat (Steele & Aronson, 1995), language bias (Brown, Reynolds, & Whitaker, 1999), and cultural bias (Jensen, 1980).

However, these claims are commonly disputed, and the position defended here is that these differences are not a matter of bias but rather reflect existing group differences in general intelligence or g (Jensen, 1998).

It is likely that the percentage of minorities and immigrants in the workforce will grow substantially for most Western countries. In many of the Western countries there is a strong trend for minority groups to grow quite strongly from year to year, while the growth of the majority group tends to stagnate. For example, census data reported by Passel, Livingstone, and Cohn (2012), shows minority births have started to outnumber majority births in the United States. The largest of these minority groups is the Hispanic group, followed by Blacks. Similar recent data published by the Dutch Central Bureau of Statistics (CBS; van Duin & Stoeldraijer, 2014) show estimates where the growth of the majority group is relatively low compared to the growth of the minority groups. What this means is that these minority populations will be becoming a larger part of the workforce in the future, and therefore it is important to know whether current selection methods for personnel are valid for this growing section of the workforce. Since the core variable for predicting work performance using intelligence testing is g, it is important to know if these intelligence differences found between groups are due to differences in g, or due to other variables.

However, group differences in intelligence have not only been established between ethnic groups. Previous studies have also found substantial intelligence differences between regions within countries (e.g. Almeida, Lemos, & Lynn, 2011), as well as differences in intelligence between countries (see Lynn & Vanhanen, 2002). Since the international market often deals with expatriates and migrant workers, it is important for companies looking to attract workers from these populations to know if these differences between countries and regions can be attributed to differences in general intelligence, or are better explained by environmental factors such as education, social economic status, or cultural factors. If differences between countries are not based on differences in general intelligence, the predictive validity of intelligence tests will not be the same for people from different countries.

(12)

11 It has been stated that there is a strong genetic component in the group differences in intelligence (Jensen, 1998). However, the strength of this genetic component is still a matter of rather intense discussion, for example te Nijenhuis, de Jong, Evers, and van der Flier (2004) found that second-generation immigrants perform better in work and educational settings than first-generation immigrants indicating that environment and integration do play an important part, but the minority groups still did not perform as well as the majority group. Of course, these are important concerns for the field of intelligence research and selection psychology, since they bring up the very fundamental concern of fairness, equality, and most importantly, equity. Therefore it is very important to understand what causes these group differences on intelligence tests, whether these are on g and if so, whether the decisions based on intelligence scores are actually justified.

Factors of intelligence and g

Sir Francis Galton was convinced that people had a general mental ability, just as people might have general physical abilities. He proposed that just as one’s ability for all kinds of physical activities such as sports, heavy physical labor, and lifting are partially based on a general physical ability, so too are mental activities such as math, language, and the sciences based on a general mental ability (Galton, 1869).

Charles Spearman was the first researcher to actually find a way to calculate the general mental ability hypothesized by Galton, for which he devised a new statistical method called factor analysis. This method made it possible to find out how heavily certain abilities loaded on this general mental ability (referred to as the g factor). Spearman hypothesized a two-factor model where individual tests consisted of the general mental ability g and skill s, but s alone did not explain all the variability in the data (Jensen, 1998). Later research showed that there are many different factors besides g. Thurstone found seven different factors,

besides g, that he dubbed the Primary Mental Abilities (PMA), namely the verbal, number, space, memory, perceptual, word, and inductive factors (Thurstone, 1938; 1940). Most models following Thurstone’s original model, including the famous and widely used Cattel-Horn-Carrol theory of intelligence (Cattel-Horn-Carroll, 1993), have built on the finding that intelligence consists of many different factors as well as the fact that all tests of mental ability measure g to a certain extent.

(13)

12 Spearman’s hypothesis and the “Jensen effect”

According to Jensen (1998), Galton was not only convinced that people had a general mental ability that played a role in all tasks that require cognitive effort, but he also believed that this general mental ability was genetically determined just like many physical features were. He also speculated about the difference between the general mental ability of Whites and Blacks based on his extensive travels and contact with African people. He believed that the general mental abilities of Africans were lower than those of his English countrymen, which is currently a very controversial position.

However, his assumptions were not based on empirical data, nor did he have the means to test his theory at that time (Jensen, 1998). Spearman’s hypothesis itself was originally theorized by Spearman (1927) and was properly researched for the first time by Jensen (1985). Spearman’s hypothesis states that the magnitude of the Black-White

differences found in intelligence is strongly, if not wholly, correlated with the g loading of the tests, subtests, or individual items. The g loading is best conceptualized as an estimate of the cognitive difficulty of a test, subtest, or individual item, which is calculated using factor analysis to find the common factor, g, in all of the subtests or items (Gottfredson, 1997; Jensen, 1998; te Nijenhuis & Metzen, 2012). Jensen (1998) found that that the IQ of Blacks was lower than the IQ of Whites by about 14 to 18 points, and that the differences at the subtest level between the groups were correlated with the g loadings of the subtests. The largest differences between the groups were found on subtests with the highest g loadings. After Jensen’s test of Spearman’s hypothesis using his own method of correlated vectors (MCV), the effects found using this method were often dubbed “Jensen effects”. (te Nijenhuis & Metzen, 2012)

Earlier research on Spearman’s hypothesis has mostly investigated the differences between Whites and Blacks. This hypothesis on the nature of Black-White differences has already been confirmed multiple times, mostly in the United States (e.g. Hartmann, Kruuse, & Nyborg, 2007; Naglieri & Jensen, 1987; Nyborg & Jensen, 2000) and Africa (e.g. Lynn & Owen, 1994; Rushton & Jensen, 2003) among other places. But research on Spearman’s hypothesis has also gone beyond Black/White differences and should now focus on the comparative intelligence differences of other ethnic groups as well.

(14)

13 Spearman’s hypothesis tested on different ethnic groups and using alternative tests

Group differences on intelligence have also been found for ethnic groups other than Blacks. For example, in the Netherlands, te Nijenhuis and van der Flier (1997) found that Antillean, North African, and Turkish immigrants scored about one standard deviation lower on the General Ability Test Battery (GATB) than the White Dutch majority. After controlling for language bias it was also clear that the magnitude of these differences correlated strongly with the g loading of the subtests. Furthermore, Spearman’s hypothesis has also been

supported using two Hispanic samples (Hartmann, Kruuse, & Nyborg, 2007). Using data from the Center of Disease Control (CDC) and the NLSY1979 survey they found that Hispanics score about .8 standard deviations below Whites on intelligence tests and found that these differences in scores correlate strongly with the g loading of the subtests.

However, these differences are not unique to intelligence tests alone and are found in many activities that require sufficient cognitive effort. For example, it has been found that there are group differences between ethnic groups on common tools that companies use to select their personnel, specifically Assessment Centers and Situational Judgment Tests (Dean, Roth, & Bobko, 2008; Goldstein, Yusko, Braverman, Smith, & Chung, 1998; Whetzel, McDaniel, & Nguyen, 2008). For both selection tools it was found that the differences between groups were directly correlated with the g loading of the tasks. Last but not least, these results have even been confirmed using tests of simple reaction time. Jensen (1993) used three different tests of reaction time, called elementary cognitive tasks (ECT) to tests

Spearman’s hypothesis and found substantial correlations between d and g. (te Nijenhuis & Repko, 2010.)

Purpose of the study

The purpose of the present study is to test how far we can extend Spearman’s hypothesis, leading to a better understanding of the causes of group differences in

intelligence, as well as giving us a better picture of the fairness of one of the best predictors of work performance for minorities and immigrants. We did this by carrying out nine

independent empirical tests of Spearman’s hypothesis.

We meta-analytically tested the hypothesis on Amerindians, Hispanics, Asians, Black and White prisoners, Black and White adults and Black and White twins. Furthermore, we

(15)

14 tested Spearman’s hypothesis using non-traditional intelligence tests to see if Spearman’s hypothesis is confirmed even when using supposedly unbiased tests. We also analyzed the differences between regions within countries to see if there is a Jensen effect between the different regions. Furthermore, we analyzed the differences between countries to see whether differences at the national level are also Jensen effects using internationally standardized school achievement tests that have been shown to be excellent measures of general

intelligence. Lastly, we also tested Spearman’s hypothesis using substantial samples of Libyan subjects and comparing them to subjects from other countries on the Raven’s Progressive Matrices.

What sets this study apart from previous studies of Spearman’s hypothesis is that we look at several ethnic groups on which Spearman’s hypothesis has not sufficiently been tested yet or not tested yet at all. This is also the first time the hypothesis is tested on regions and countries.

Most of the data reported in these studies has never before been used to test

Spearman’s hypothesis until now, nor were there any meta-analytical overviews for many of the groups in question, making most of these findings both new and a substantial addition to differential intelligence research, and personnel selection research.

(16)

15 General Method

Methods of meta-analysis and correcting for artifacts1

Similar to my predecessors, I will be using psychometric meta-analytical techniques to adjust for sampling error (e.g. te Nijenhuis & Metzen, 2012; te Nijenhuis & Repko, 2010; te Nijenhuis & Willigers, 2011). In their influential book Hunter and Schmidt (2004) plead for the use of meta-analysis, which is best described as the aggregation of data from different studies and datasets, which are then corrected for statistical and study artifacts. They list 11 different artifacts that can influence the outcome of studies and they state that by reducing the influence of these artifacts we can obtain a much clearer picture of the relationship between variables. These artifacts are sampling error, error of measurement in the dependent variable, error of measurement in the independent variable, dichotomization of a continuous dependent variable, dichotomization of a continuous independent variable, range variation in the

independent variable, range variation in the dependent variable, deviation from perfect construct validity in the independent variable, deviation from perfect construct validity in the dependent variable, reporting error, and variance due to extraneous variables.

Due to the limited amount of time available for carrying out the work on a master thesis, in this study we carry out a bare-bone meta-analysis, so we will only correct for sampling error in the data, this is the error that is introduced into data due to usage of small samples in studies. Smaller samples tend to have more variance which can cause differences to be significant when they are not, or vice versa, simply due to the random nature of the sample selected. This correction was carried out using the meta-analytical software developed by Schmidt and Le (2004).

Lastly, we have decided to use Pearson’s r instead of Spearman’s Rho in the MCV as this was also done in previous studies of Spearman’s hypothesis (e.g. te Nijenhuis & Dragt, 2010; te Nijenhuis, van Vianen, & van der Flier, 2007; see te Nijenhuis & Dragt, 2010 for more information).

1

(17)

16 Correcting for unequal group sizes in a data point2

In their influential book on psychometric meta-analysis Hunter and Schmidt (2004) use the sum of all participants in all groups in a study that is used as a data point in a meta-analysis as the value of the total sample size. However, in the data points in the current study there was often a large disparity between group sizes; for instance, quite often samples report data on 100 Blacks and 1000 Whites. A sample of 100 has quite substantial sampling error, whereas a sample of 1000 indicates a much smaller sampling error.

What is a good indicator of the sample size of such a data point combining two datasets? The strictest choice would be to simply use the value of the smallest sample.

However, this would ignore the positive influence of the much larger sample on the sampling error of the data point. A comparison could be made with testing the means of samples of unequal size for significance: A difference between samples of 900 and 100 reaches significance less quickly than the difference between samples of 500 and 500,

notwithstanding the fact that the total N is equal. Stated differently, the increase in precision for the sample of 900 does not outweigh the decrease in precision for the sample of 100. A harmonic N takes this into account.

There are several formulas for harmonic N that could be used. A common formula is where N is the number of groups and xi is the size of each individual group

(Klockars, & Sax, 1987; te Nijenhuis & Dragt, 2010). The advantage of this formula is that for a data point with samples of 100 and 900 the value of the harmonic N = 180, which is quite close to the value of the smallest sample, indicating quite strong sampling error. However, the disadvantage of this formula is that for a data point with samples of 15 and 15 the total sample size is only 15, and that for a data point with samples of 500 and 500 the total sample size is only 500.

te Nijenhuis and van der Flier (2013) used the formula where N is the number of groups and xi is the size of each individual group. For a data point with samples of

100 and 900 the value of the harmonic N then becomes 360 (see Table 1), which is quite conservative, but not as strict as the value of only 180 for the first formula. For data points

2

(18)

17 with samples of 15 and 15, the total sample size now becomes 30 (see Table 1), and for a data point with samples of 500 and 500 the total sample size now becomes 1000, which is in line with the reasoning in Hunter and Schmidt (2004) mentioned above. We therefore continue to use this formula, which is based on sound reasoning, namely that data points consisting of samples with widely differing Ns receive a substantially reduced weight in a meta-analysis, and that data points based on samples with highly comparable weights receive a weight based on the total number of research participants in these samples.

Table 1

Various values for the harmonic N of data points with two samples using two different formulas N of Group 1 (x1) N of Group 2 (x2) Formula 11 Formula 21 15 15 15 30 500 500 500 1000 100 900 180 360

Note. Since the current data points consist of two comparison groups this gives the formula as used in te

Nijenhuis and van der Flier (2013). 1N is the number of groups in the comparison.

Method of Correlated Vectors3

As stated by Arthur Jensen (1998), the method of correlated vectors allows us to correlate the g factor, the measure of the cognitive difficulty of a task, with a secondary variable of interest such as ethnicity or gender. The g vector generally consists of the g loadings of the subtest of an IQ battery, while the second vector is often an effect size in the form of a correlation or an estimation of the difference between two groups. The method of correlated vectors consists of taking the column with the g loading of each subtest in an intelligence battery and correlating it with the column of the effect size of the secondary variable of interest on those same subtests (Jensen, 1998; te Nijenhuis & Franssen, 2010). In cases where Spearman’s hypothesis is tested, this effect size is often expressed in the d score, an estimate of the standardized difference between groups.

3

(19)

18 The d score was calculated in a similar fashion as was done in te Nijenhuis and

Franssen, (2010), by subtracting the score of the lower scoring group from the score of the higher scoring group. These differences in subtest scores between the groups are then correlated with the g loadings of the subtest. A strong positive correlation indicates that the difference between groups on the subtests becomes larger as the g loadings of subtests increase, a strong negative correlation indicates that the differences between groups on the subtest become smaller as the g loadings of the subtests increase, and a weak or non-existent correlation indicates there is no relation between the differences between groups and g loading. In this paper the method of correlated vectors is indicated as r (d x g).

Testing Spearman’s hypothesis4

te Nijenhuis and Dragt (2010) cite Jensen (1993) who states that seven methodological requirements for the testing of Spearman's hypothesis have to be met when using the subtests of IQ batteries:

1. The samples should not be selected on any highly g-loaded criteria. 2. The variables should have reliable variation in their g loadings.

3. The variables should measure the same latent traits in all groups. The congruence coefficient of the factor structure should have a value of >.85.

4. The variables should measure the same g in the different groups; the congruence coefficient of the g values should be >.95.

5. The g loadings of the variables should be determined separately in each group. If the congruence coefficient indicates a high degree of similarity, the g loadings of the different groups should be averaged.

6. To rule out the possibility that the correlation between the vector of g loadings (Vg) and the vector of mean differences between the groups, or effect sizes (VES), is strongly influenced by the variables' differing reliability coefficients, Vg and VES should be corrected for attenuation by dividing each value by the square root of its reliability.

7. The test of Spearman's hypothesis is the Pearson correlation (r) between Vg and VES. To test the statistical significance of r, Spearman's rank order correlation (rs) should be computed and tested for significance.

4

(20)

19 Due to our methodology and the time-constraints of a master’s thesis, it was not

always possible to adhere to all requirements. Since we use a considerable amount of data that only give information at the subtests score level for each group but do not specify g loadings or highly unreliable g loadings, we often used g loadings from standardization samples, the largest available sample, or loadings computed using the correlation matrix in the study and therefore criteria 3-5 were generally impossible to apply. Regardless, the g loadings from standardization samples or large samples can be expected to be highly reliable. Following requirement 6, this would be too time-consuming for a master’s thesis of this size due to the large amount of different tests, and samples used in this study; however, adjusting for unreliability usually increases the correlation by a substantial amount rather than reduces it (e.g. te Nijenhuis & Dragt, 2010; te Nijenhuis & Willigers, 2011). Requirement 7 does not apply in meta-analyses: the total sample size becomes so large that significance testing is less efficient when compared to significant testing on smaller sample sizes (te Nijenhuis & Dragt, 2010, te Nijenhuis & Willigers, 2011). Furthermore, it should hold true for large sets of aggregated data as well, as found in the studies on countries.

Criteria for confirmation of Spearman’s hypothesis5

Since all studies in this project are tests of Spearman’s hypothesis, all hypotheses should be confirmed or rejected using the same value of the effect size. Jensen (1998, p. 377-378) finds an average effect size of r = .63 for Spearman’s hypothesis tested on Blacks and Whites, when the reliability coefficients were partialed out. Furthermore, for 16 independent studies he reports a mean effect size of r = .59 with an SD of .12; the latter finding might be more applicable to our studies, since we do not partial out the reliability coefficients in our studies. With a mean effect size of .59, it is important to note that some studies also found lower correlations. Cohen (1992) indicates that an r value of .5 or higher is a large effect size in social sciences. Many researchers follow this guideline and this seems like a good indicator for our studies as well. Taking both the findings of Jensen and Cohen into account, it seems appropriate to use an r value of .5 or higher as a confirmation of Spearman’s hypothesis and a lower value as a non-confirmation of Spearman’s hypothesis for all studies. This number might seem low, but it is important to keep in mind that this is the value is before substantial

5

(21)

20 corrections, and a full-fledged meta-analysis will often substantially increase the value of the correlations (e.g. te Nijenhuis & Dragt, 2010; te Nijenhuis & Willigers, 2011).

Choice of SD used in calculating the d difference scores6

The effect size, d, is calculated by subtracting the mean of the lower scoring group from the mean of the higher scoring group and then dividing this by the standard deviation. The selection of standard deviations is very important for calculating the correct effect size, the standardized difference between groups, and therefore a procedure was used to select which standard deviation would be best used in these calculations. Whenever possible, the standard deviations were taken from nationally representative standardization samples or norming samples for the tests used in the study. This is the preferred option since the standard deviation of a large and representative sample is closer to the population standard deviation than a small study sample and thus helps to give a more accurate indication of the effect size.

However, for some tests no such samples could be obtained. When this was the case, the standard deviations of the largest group were used to compute d, since a larger group would still have more reliable standard deviations than a small group. If both groups were of equal size, the SD from the majority group was used since it is more likely to be

representative of the population SD than that of the minority group Selecting a g loading for calculating r (d x g)7

The selection of the correct g loading for calculating r (d x g) is very important because the g loading based on a large and representative sample will be more representative of the g loading for the entire population. Whenever possible, the g loadings of the subtests in an intelligence battery were calculated using the data of a nationally representative norming or standardization samples, similar to the choice for SDs in te Nijenhuis and Metzen (2012). However, some studies use samples that are not representative of the population and other studies use non-standard tests. In these cases, a choice has to be made as to what g loadings to use. While large samples are still preferable they might not always be a good fit for the

6

Methodology taken and adapted from te Nijenhuis and Metzen (2012, p. 34.).

7

Methodology based on te Nijenhuis and Dragt (2010, p. 45), te Nijenhuis and Metzen (2012, p. 34.), and te Nijenhuis and Repko (2011, p. 23).

(22)

21 sample in the study. For example some datasets for calculating g might have a large sample but it might not be representative of the sample used in the study (e.g. a large difference in age), while other datasets have a small sample but are a much better representation of the sample used in the study.

Similar to the choices from te Nijenhuis and Metzen (2012) for SDs and the choices for g from te Nijenhuis and Dragt (2010) and te Nijenhuis and Repko (2011), preference was given to using larger samples as long as they were relatively representative, however if this was not possible or not appropriate for the data, a different solution was used. In a few cases the g loadings were taken from g loadings mentioned in the study or calculated using

intercorrelations that were given in the study. This was often the case for novel or less-well-studied intelligence test batteries. In a few cases there were no statistical data to derive a g loading from. If this was the case, the written descriptions of the subtests were used to

establish the broad cognitive ability that each subtest measured and the g loading of the broad ability, as given in Carrol (1993), was then used as an estimate of the g loading for the subtest. Selecting comparison groups8

Tests of Spearman’s hypothesis require two groups for the comparison, where Whites are usually compared to other groups. It is important to note that even when there is a

comparison group in the studies, the White comparison samples are sometimes relatively small and therefore reduce the reliability of the comparison. So to remain consistent, we have chosen to compare most ethnic groups to the whole standardization sample of the tests whenever possible and the White groups from the standardizations were used whenever specific data was available but this was rarely the case.

While these samples are not always fully White, the vast majority of the individuals in the sample tend to be White. Due to the large size of the standardization samples compared to the often small comparison samples in studies and the fact that the standardization samples consists mostly of Whites, these samples should be considered valid comparison groups to test Spearman’s hypothesis. So for all intents and purposes, we considered these groups to be White groups. If there was no standardization data available and there was no White

8

(23)

22 comparison sample in the study or one in a comparable other study, we compared the groups to the scores for which the test is standardized. This was often a mean of 100 and an SD of 15, or a mean of 10 with an SD of 3.

Language bias in tests9

Te Nijenhuis and Dragt (2010) as well as te Nijenhuis and Willigers (2011) studied the effect of language bias in the subtests of intelligence batteries on the estimation of d. They found that subtests with a strong language component would consistently underestimate the level of g of non-native speakers because these subtests largely measured the language proficiency of the non-native test taker rather than g.

While te Nijenhuis and Willigers (2011) already explored the topic of language bias, we chose a slightly different approach using a cutoff score. First, the attention was focused on subtests with a substantial language component. A first scatter plot with g loadings on the x-axis and d scores on the y-x-axis was drawn and it was checked whether the subtests with a substantial language component were substantially above the regression line. These subtests were then left out for the second scatter plot, and a new regression line of d on g was drawn. The expected value of d for the possibly biased subtest was recalculated using the g loading of the subtest. The expected d score is then subtracted from the observed d score, and if the residual value is .20 or larger, the subtest is considered to be language biased. This differs slightly from the approach taken by te Nijenhuis and Willigers who did not use an explicit cutoff score. The new approach has the advantage of offering a more explicit criterion.

If a subtest is language-biased the d between groups will be much larger than expected based on the subtest’s g loading and the data points of these subtests would be located further away from the regression line in the scatter plot. To give a more accurate estimation of r (d x g), subtests that are identified as language-biased using the aforementioned method will be removed from the estimations of r (d x g).

9

(24)

23 Basic inclusion criteria10

The primary requirement for any study to be included in this paper was having at least four tests or subtests to which the method of correlated vectors can be applied. The study cannot use test-retest data or a counterbalanced design, since this could lead to a learning effect for the sample and could interfere with out tests of Spearman’s hypothesis. Lastly, the comparison samples in the studies should not be selected on a highly g-loaded variable (e.g. referral or gifted samples compared to normal samples; Jensen, 1993).

Descriptions of tests used in studies

Several studies in this master thesis made used of the same tests, so a description of these common tests will be given here. Tests that are only found in one of the studies will be described in the section on that specific study. The general setup was inspired by te Nijenhuis and Metzen (2012).

Wechsler scales.11 The Wechsler family of tests is one of the most commonly used IQ tests and the different tests were made for the purpose of usage with different populations. All Wechsler tests are split between verbal and performance subtests, and they yield scores for Verbal, Performance, and Full-scale IQ. The reliabilities for Wechsler tests are generally very good, for example Schuerger and Witt (1989) found a test-retest reliability of .91 and .82 for the original WAIS and WISC respectively and similar results were found for the revisions with a reliability of .89 and .92 for the WAIS-R and WISC-R respectively.

The Wechsler Preschool and Primary Scale of Intelligence (WPPSI; Wechsler, 1967) is meant for young children. It contains 11 subtests and has been standardized on children ages 4-6.5. The verbal subtests of the WPPSI are Information, Vocabulary, Arithmetic, Similarities, Comprehension, and Sentences. The performance subtests of the WPPSI are Animal House, Picture Completion, Mazes, Geometric Design, and Block Design.

The Wechsler Intelligence Scale for Children (WISC; Wechsler, 1949) was meant to test the intelligence of older children and adolescents, and was standardized on children ages

10

Taken and adapted from te Nijenhuis and Metzen (2012, p. 32).

11

(25)

24 6-16. The WISC verbal subtests are Information, Comprehension, Arithmetic, Similarities, Vocabulary, and Digit Span. The performance subtests of the WISC are Picture Completion, Picture Arrangement, Block Design, Object Assembly, Coding, and Mazes. The revised WISC-R (Wechsler, 1974) had the same subtests as the WISC. The WISC-III (Wechsler, 1991) added Symbol Search as a performance subtest. Lastly, the WISC-IV (Wechsler, 2003) discarded Picture Arrangement, Object Assembly, and Mazes but added Picture Concepts, Letter-Number Sequences, Matrix Reasoning, Cancellation, and Word Reasoning.

The Wechsler Adult Intelligence Scale (WAIS; Wechsler, 1955) was devised for adults and was standardized on subjects ages 16 and older. The verbal WAIS subtests are

Information, Comprehension, Arithmetic, Similarities, and Digit Span. The performance subtests of the WAIS are Picture Completion, Block Design, Picture Arrangement, Object Assembly and Digit Symbol. The WAIS-R (Wechsler, 1981) has the same subtests as the WAIS. Lastly, the WAIS-III (Wechsler, 1997) adds Letter-Number Sequencing, Matrix

Reasoning, Symbol Search, and expands on the Digit Symbol subtest that was used in both of the other revisions.

K-ABC. The Kaufman Assessment Battery for Children (K-ABC; Kaufman & Kaufman, 1983) is an intelligence and achievement test that was standardized for use with children ages 2.5 to 12.5. The test uses different subtests for the pre-school children (ages 2:6 – 4:11) than for those of age 5 and older. The K-ABC is also a very reliable test, the split-half reliability for all scales ranges from .86 to .97 for different ages and similar results were found for test-retest reliability (Kaufman & Kaufman, 1983b, cited in Kaufman & Kamphaus, 1984). The preschooler intelligence subtests for Sequential Processing are Hand Movements, Number Recall, and Word Order. The subtests for Simultaneous Processing are Magic Windows, Face Recognition, Gestalt Closure, and Triangles. The three achievement tests are Expressive Vocabulary, Faces & Places, and Arithmetic. For the ages 5 and older the Magic Window, Expressive Vocabulary, and Hand Movement subtests are removed. Matrix

Analogies, Spatial Memory, and Photo Series are added to the Simultaneous Processing scale. The KABC-II (Kaufman & Kaufman, 2004) is intended for usage with children ages 3 to 18, contains 16 subtests, and was partially derived from Carroll’s three-stratum theory (Reynolds, Keith, Ridley, & Patel, 2008). The data analyzed in this study only contains subjects of age six and up, so only the relevant subtests for this group will be mentioned.

(26)

25 These subtests are Riddles, Verbal Knowledge, Expressive Vocabulary, Rover, Triangles, Block Counting, Gestalt Closure, Story Completion, Pattern Reasoning, Rebus, Rebus Delated, Atlantis, Atlantis Delayed, Word Order, Number Recall, and Hand Movements.

Woodcock-Johnson. The Woodcock Johnson (WJ; Woodcock, 1978) and its revisions are another set of test batteries that contain both intelligence and achievement subtests

(Murray, 2007), but only the intelligence subtests are of interest here since we did not use the achievement subtests in our studies. The WJ-III manual gives a mean test-retest reliability of .94 over all subtests (McGrew & Woodcock, 2001), indicating that the Woodcock-Johnson tests are quite reliable. The WJ was standardized for usage with people ages 2-84(Murray, 2007). The subtests for the WJ were Antonyms-Synonyms, Picture Vocabulary, Concept Formation, Analysis-Synthesis, Analogies, Memory for Sentences, Numbers Reversed, Visual-Auditory Learning, Sound Blending, Spatial Relations (Timed), and Visual Matching.

The Woodcock-Johnson - Revised (WJ-R; McGrew, Werder, & Woodcock, 1991) was standardized on a sample of people ages 2-95 (Murray, 2007). Compared to the WJ, some of the subtests changed. Antonyms-Synonyms was renamed Oral Vocabulary, Analogies and Spatial Relations (Timed) were removed, and Memory for Names, Incomplete Words, Cross-Out, Picture Recognition, and Visual Closure were added.

The WJ-III (McGrew & Woodcock, 2001) was standardized on a sample of people ages 2-98 (Murray, 2007). There were some changes compared to the WJ-II. Oral Vocabulary, Picture Vocabulary, Memory for Sentences, Memory for Names, Incomplete Words, Cross-Out, and Visual Closure were removed. Verbal Comprehension, General Information, Memory for Words, Retrieval Fluency, Auditory Attention, and Spatial Relations (untimed) were added.

DAS. The Differential Ability Scales (DAS; Elliott, 1990) is a test based on the British Ability Scales (BAS); the DAS was developed and standardized in the US. It focuses more on specific abilities than on just intelligence itself (Flanagan, Genshaft & Harrison, 1997). The DAS has a good reliability, with an internal consistency reliability range from .70 to .92 for the subtests and a test-retest reliability of .38 to .97 for the subtests (Platt, Kamhaus, Kletgen, & Gilliland, 1991). It contains 17 different intelligence subtests and was standardized on children ages 2:6 to 17:11, and the subtests given depend on the age category of the subjects.

(27)

26 The core subtests for preschoolers are Block Building, Verbal Comprehension, Naming

Vocabulary, Picture Similarities, Pattern Construction, Copying, and Early Number Concepts. The diagnostic subtests for preschoolers are Matching Letter-Like Forms, Recall of Digits, Recall of Objects, and Recognition of Pictures. The school-aged core battery consists of Word Definitions, Similarities, Matrices, Sequential and Quantitative Reasoning, Recall of Designs, and Pattern Construction. The diagnostic subtests for school-aged children are Recall of Digits, Recall of Objects, and Speed of Information Processing.

The DAS-II (Elliot, 2007) is a revision of the DAS; the test has a total of 21 subtests and is meant to be used with children 2:6 to 17:11 (Trundt, 2013). As with the original DAS, this version also has differences in the subtests given to preschoolers and school-aged children. The subtests used are Verbal Comprehension, Naming Vocabulary, Word Definitions, Verbal Similarities, Early Number Concepts, Picture Similarities, Matrices, Sequential and Quantitative Reasoning, Copying, Matching Letter-Like Forms, Pattern Construction, Recall of Designs Recognition of Pictures, Recall of Objects – Immediate, Recall of Objects – Delayed, Recall – Digits Forward, Recall – Digits Backward, Recall of Sequential Order, Speed of Information Processing, Rapid Naming, and Phonological Processing.

KAIT. The Kaufman Adolescent and Adult Intelligence Test (KAIT) is a test designed to measure the broad abilities fluid and crystallized intelligence, as well as general

intelligence (Kaufman & Horn, 1996). The extended KAIT battery has 10 different subtests (4 each for Gc and Gf) and the test was standardized using 2000 subjects ranging from age 11 to 85. The KAIT has three subscales, Crystallized Intelligence, Fluid Intelligence, and Delayed Recall. The subtests for the Crystallized scale are Definitions, Auditory Comprehension, Double Meanings, and Famous Faces. The subtests for the Fluid scale are Rebus Learning, Logical Steps, Mystery Code, and Memory for Block Design. The last two subtests are tests of delayed recall.

AFOQT. The Air Force Officer Qualifying Test (AFOQT; Berger, Gupta, & Berger, 1990) is a selection tool for Air Force officers, but is also used for the selection of pilot and navigator for placement in training programs. (Carretta, 1997). The test was standardized on a mostly White male sample of 269,968 Air Force applicants. The reliability of the tests range from α = .69 to α = .92 (Skinner & Ree, 1987, cited in Caretta & Ree, 1996). The test

(28)

27 contains 16 different subtests and these are divided among verbal tests, quantitative tests, spatial tests, interest/aptitude tests, and perceptual speed tests. The verbal subtests are Verbal Analogies, Reading Comprehension, and Word Knowledge. The quantitative tests are

Arithmetic Reasoning, Data Interpolation, and Math Knowledge. The spatial tests are Mechanical Comprehension, Electrical maze, Block Counting, Rotated Blocks, and Hidden Figures. The aircrew Interest/aptitude tests include Instrument Comprehension, Aviation Information, and General Science. The perceptual speed tests are Scale Reading and Table Reading.

Project Rainbow. Sternberg (2006) studied several assessment tools which, according to the author, could be supplementary to the SAT in predicting college GPA and reduce racial bias by also using indicators of creative and practical skills as supplement to analytical skills. The scholastic achievement test (SAT) is a standardized test, which is often taken during the high school years to determine educational opportunities. This exam is used by most

universities and colleges as a predictor of college success and is often considered to be highly comparable to an IQ test. SAT-V is the verbal section of the SAT test. SAT-M is the

mathematical section of the SAT test.

The idea for this supplement to the SAT in Project Rainbow was based on Sternberg’s theory of successful intelligence, which assumes that skills other than analytical intelligence are important for success in life. As such, we decided to treat all these measures as IQ subtests. The sample in the project consists of 47 Blacks, 89 Hispanics, 348 Whites, 77 Asians, 11 Pacific Islanders, 11 Native Americans, 37 describing themselves as “other”, and 157 did not report an ethnic background. The samples we are interested in for this study are the Black, Hispanic, and Asian samples, and we will use those samples in our study. Since this is a combination of tests and not a commonly standardized battery, the specific subtests used to complement the SAT will be described in detail.

Sternberg’s Triarchic Abilities Tests (STATAnalytical, STATPractical, and STATCreative) are tests

that use multiple-choice questions to measure analytical, practical, and creative skills,

respectively, and are based on Sternberg’s theory of successful intelligence. The tests correlate substantially with traditional IQ tests (Sternberg, Ferrari, Clinkenbeard, & Grigorenko, 1996). For the Everyday Situational Judgment test participants watched seven brief video vignettes with problems encountered in everyday situations and had to choose one of six written

(29)

28 options as the most appropriate response for the situation. The Common Sense Questionnaire contained 15 vignettes that capture business-related problems; participants had to select the most appropriate response to each vignette from eight written options. The College Life Questionnaire contained 15 written vignettes that were supposed to capture everyday college-related problems; participants had to select the most appropriate response to the situation from several written options. Cartoons consisted of five cartoons without captions taken from the New Yorker of which the participant had to select three and then fill in the captions; this was then rated for creativity on a 5-point scale by trained judges. For Written Stories, participants had to select two story titles from six possible story titles and then had to write a short story appropriate for each of the two titles. They were given 15 minutes to write each story and they were rated for originality, complexity, emotional evocativeness, and descriptiveness on 5-point scales by trained judges. For Oral Stories, participants were presented with five sheets of paper with 11 to 13 images each that were linked by a common theme; after choosing one of the pages the participants were given 15 minutes to formulate a short story and dictate it using a recording device. The same thing was then done for a second oral story and they were rated for originality, complexity, emotional evocativeness, and descriptiveness on 5-point scales by trained judges.

ASVAB. The Armed Services Vocational Aptitude Battery (ASVAB; U.S. Department of Defence, 1982) is used to select military applicants for enlistment in the army and their first job within the army (Ree & Carretta, 1995). The test has 10 subtests and they are General Science, Arithmetic Reasoning, Word Knowledge, Paragraph Comprehension, Numerical Operations, Coding Speed, Auto and Shop Knowledge, Mathematics Knowledge, Mechanical Comprehension, and Electronic Information.

UNIT. The UNIT (Bracken, & McCallum, 1998) is a non-verbal test of intelligence that uses hand gestures instead of verbal communication for both the examiner and the

examinee. The test is norm referenced (Kane, 2008) and has been standardized on a sample of 2,100 children between the ages 5 to 17 years, 11 months, and 30 days. It was stratified based on 1995 U.S. Census data (Wilhoit & McCallum, 2002). The sample used in the study by Kane (2008) consisted of 77 White, 77 Black, and 77 Hispanic participants. The UNIT consists of six different subtests (Kane, 2008): Symbolic memory presents examinees with colored universally recognizable stimuli and consequently the examinee has to recall and

(30)

29 reproduce the presented sequence. Cube design has examinees copy a three-dimensional geometric design made from white and green cubes. Spatial memory has examinees recall and recreate a pattern of black and green chips on a response grid. Analogic reasoning has the examinees complete geometric and symbolic analogies or patterns. Object memory shows the examinees several common objects which they have to recall and identify from a larger array of objects after a delay. The subtest Mazes has examinees draw a path through a collection of mazes, from the center to the exit of the maze.

Leiter-R. Flemmer and Roid (1997) reports the scores of 258 White and 62 Hispanic adolescents on the Reasoning and Visualization subtests of the Leiter International

Performance Scale-Revised (Leiter-R; Roid & Miller, 1997). According to Flemmer and Roid (1997), the main advantages of the Leiter-R is that the tests are entirely non-verbal, both in responding and giving directions to the participants. The subjects are only required to give one of three possible responses, namely pointing, aligning objects, or aligning cards with pictures on them. Furthermore, the directions themselves are also given non-verbally, using non-verbal gestures and signals. Only two of the subscales were used in these studies. The Reasoning subscale contains a total of four subtests and they are: Classification, Sequential Order, Repeated Patterns, and Design Analogies. The six Visualization subtests are: Matching, Figure-Ground, Paper folding, Figure rotation, Picture context, and Form completion. The Matching, Classification, and Picture Context subtests were not used due to a lack of appropriate g loadings.

CAS. The Cognitive Assessment System (CAS) is a cognitive-processing test based on the Planning, Attention, Simultaneous, and Successive intelligence (PASS) theory

(Naglieri, Rojahn, & Matto, 2006). The test itself was standardized on 2,200 children and adolescents between the ages of 5 to 17. The amount of subjects varied per subtest, and we will use the smallest N for calculations. The non-Hispanic group ranged from 1,808 to 1,954 participants, and the Hispanic group ranged from 217 to 244 subjects. The subtests are divided into the four different subscales according to the PASS theory, namely Planning, Simultaneous, Attention, and Successive. Planning is about cognitive control, Attention is selective and focused cognitive activity over time, Simultaneous is about integrating stimuli into greater wholes, and Successive Processing is about cognitive activity in a serial order of progression (Naglieri, Rojahn, & Matto, 2007). The Planning subtests are Matching Numbers,

(31)

30 Planned Codes, and Planned Connections. The Simultaneous subtests are Nonverbal Matrices, Verbal-Spatial Relations, and Figure Memory. The Attention subtests are Expressive

Attention, Number Detection, and Receptive Attention. The Successive subtests are Word Series, Sentence Repetition, Sentence Questions, and Successive Speech Rate.

(32)

31 Study 1

Spearman’s Hypothesis tested on Twin Data using a large number of subtests In 1980, Robert Osborne (1980) published his seminal book entitled Twins: Black and White, which describes an extensive twin study with twins from two different groups. The goals of this book were to make a substantial amount of data available for researchers at that time, to analyze the difference between Blacks and White in mental abilities, and look at the heritability of intelligence. The study is impressive since it reported data on no less than 132 different variables such as: race, gender, head size, blood type, intelligence, and personality. The study enables us to carry out a unique test of Spearman’s hypothesis for several reasons. Firstly, the study uses a large sample of 992 White and Black twins which means there will be only a limited amount of sampling error compared to the relatively small samples often found in tests of Spearman’s hypothesis. Secondly, the study consists solely of twins; even in this era twin data on this scale are relatively rare. But what really makes this study interesting, and sets it apart from most studies used for testing Spearman’s hypothesis, is the large number of subtests used. The study contained no less than 31 different

standardized intelligence subtests commonly used at the time. This is impressive since it is an even greater number of subtests than the 24 subtests reported in a study by Naglieri and Jensen (1987), which is the largest number of subtests employed in a test of Spearman’s hypothesis to date. For these reasons, the data from this study can be used for a high-quality test of Spearman’s hypothesis with fewer artifacts from sampling error and more variance of g in the available subtests.

This study is an extension on previous findings of Black/White differences, using a large sample and more subtests to improve the accuracy and reliability of the test of Spearman’s hypothesis. As with previous tests of Spearman’s hypothesis on Blacks and Whites (e.g. Hartmann, Kruuse, & Nyborg, 2007; Lynn & Owen, 1994; Naglieri & Jensen, 1987; Nyborg & Jensen, 2000; Rushton & Jensen, 2003), we believe that this study will give a strong confirmation of Spearman’s hypothesis.

Hypothesis 1: There will be a strongly positive correlation between the Black/White group differences in the Osborne data and the subtests’ g loadings.

(33)

32 Method

Participants

The sample for the Twin Study was drawn from public and private schools in Louisville, Kentucky, and Jefferson County, Kentucky. 496 pairs of twins were studied, of which 427 pairs were like-gendered and 50 pairs were unlike-gendered. The data also

contained 19 pairs who were not used in our analysis, since they were part for a different part of the Osborne study. There were a total 692 White subjects and 300 Black subjects in the data used for this analysis. The age of the subjects ranged from 12-20.

Instruments

Since the Osborne study used a large combination of different intelligence tests, all the specific subtests will be described (Osborne, 1980):

The Calendar Test consists of 50 different statements about days of the week on which the participant can indicate whether the statement is true or false. For example: “If today is Sunday, then tomorrow will be Monday”.

The Cube Comparison test has participants decide whether two drawings of a cube are of the same cube or of different cubes. This is done by using the letters on the sides of the cube and mental rotation of the cube to determine if they are the same or different.

The Simple Arithmetic test is a speeded test and consists of seven parts for which participants are given 2 minutes per part to complete all the questions. The first part consists of 15 questions, the second part of 20 questions and all other parts consist of 25 questions; all questions have five possible answers. The questions consist of multiplication, addition, division, and subtraction problems, which become progressively easier in later parts.

The Wide Range Vocabulary test gives participants a word and the choice of five synonyms ranging from very easy to very difficult.

The Surface Development test consists of participants being shown drawings of a piece of paper with dotted lines where it can be folded and an example of the folded paper to the right. The lines on the unfolded piece of paper have numbers on them and the edges on the

(34)

33 folded object are marked with letters. The participants have to match the numbers from the unfolded drawing with the letters of the folded drawing.

The Form Board test consists of a drawing of an outline figure and five black pieces. The participant then has to indicate which pieces need to be combined to form the outline figure.

The Self-judging Vocabulary consists of two parts. In the first part participants have to mark 128 questions with A (I know this word and can explain it to someone unfamiliar with it), B (I am doubtful as to what this word means.), C (I have never seen this word before and have no idea what it means). The second part consists of the first 80 words from the earlier 128 of which the participants have to give the meaning, choosing from six possible answers.

In the Paper Folding test, participants are shown drawings of two or three folds in a square piece of paper. They then have to indicate what the unfolded square paper would look like out of five possible answers.

In the Object Aperture test, participants are shown a drawing of a three dimensional object and drawings of five different apertures. The participants have to select which of the apertures the (rotated) three dimensional object would actually fit through.

The Identical Pictures test has participants compare an example figure to five other figures of which they have to select the picture that is identical to the example figure.

The Spelling Achievement test has participants write a word that is pronounced by the examiner, the examiner also uses the word in a sentence.

The Maze test has participants draw a continuous line from the entrance of a maze to the exit of the maze without making any mistake or hitting a dead end.

The Logical Reasoning test presents the examinee with two statements and they have to select one of four conclusions that can be derived from the statements.

In the Cancellation test, participants are given a series of dotted images with three or more dots and asked to draw a vertical line through each group of five dots and a horizontal line through groups of four dots.

(35)

34 The Social Perception test has examinees look at sets of four different drawings, out of which one is different from the rest. The examinee has to indicate which drawing is not like the three others.

In Ship Destination, participants are given a diagram – the diagram consists of a grid that connects letters that are encased in a circle. Each circle on the diagram represents a point on the ocean; the line between each circle represents two miles of distance. Participants are then asked a series of questions about the diagram and movement on the grid.

In the Card Rotation test participants are given a model on the left and eight items on the right of the model. Participants are then asked to decide which of the item on the right show the same sides as the model on the left and which of the items are mirrored images.

The Mooney's Faces test features images of human faces. Participants are then asked to determine the sex of the person in the image and the direction that they person was looking.

Primary Mental Abilities measures five different mental abilities, namely Verbal Meaning, Number Facility, Reasoning, Perceptual Speed, and Spatial Relations. The score is given in a single quotient score for the Osborne data.

The Culture Fair Intelligence Test (CFIT) is a test of children’s ability to learn new things in the future rather than a test of what the child knows at the time of testing.

Newcastle Spatial test subtest 1 has 12 solid objects on one page and on the opposite page there are 10 drawing sets where the end views and middle sections of a solid object are shown. The task of this test is to match each of the 10 drawing sets to one of 12 solid objects on the other page.

For Newcastle Spatial test subtest 2 the participants are shown a solid model and then have to select the top view of the model from four different top view drawings.

In Newcastle Spatial test subtest 3 the participants are given a solid model of a cube with a shaded parts and a flat pattern of three sides of a cube. Participants are then asked to draw lines on the pattern to indicate where they would cut to remove shaded parts on the model.

Referenties

GERELATEERDE DOCUMENTEN

Using the sources mentioned above, information was gathered regarding number of inhabitants and the age distribution of the population in the communities in

Contribution of a range of older individuals to children’s learning by providing hunting tools, taking children on hunting trips where they have an opportunity to observe hunting

The enumerate environment starts with an optional argument ‘1.’ so that the item counter will be suffixed by a period.. You can use ‘(a)’ for alphabetical counter and ’(i)’

It states that there will be significant limitations on government efforts to create the desired numbers and types of skilled manpower, for interventionism of

Degradation products of KYN, in particular quinolinic acid (QA) and 3-hydroxykynure- nine (3-OH-KYN), are potentially neurotoxic and may additionally contribute to the development of

This hypothesis poses that women are not generally better in the detection of emotions on the face, but would be especially better in the perception of target emotions in low

The section that fol- lows contains the translation from [1] of the learning problem into a purely combi- natorial problem about functions between powers of the unit interval and

3 september 2018 Ethics and the value(s) of Artificial Intelligence Martijn van Otterlo.. happen if AI achieves general human-level