• No results found

Inductive reasoning in Zambia, Turkey, and the Netherlands Establishing cross-cultural equivalence

N/A
N/A
Protected

Academic year: 2021

Share "Inductive reasoning in Zambia, Turkey, and the Netherlands Establishing cross-cultural equivalence"

Copied!
1
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

RUNNING HEAD: Cross-Cultural Equivalence of an Inductive Reasoning Test

Inductive Reasoning in Zambia, Turkey, and The Netherlands: Establishing Cross-Cultural Equivalence

Fons J. R. van de Vijver Tilburg University

The Netherlands

Mailing address:

Fons J. R. van de Vijver Department of Psychology Tilburg University PO Box 90153 5000 LE Tilburg The Netherlands Phone: +31 13 466 2528 Fax: +31 13 466 2370 E-mail: fons.vandevijver@kub.nl

(2)

Abstract

Tasks of inductive reasoning and its component processes were administered to 704 Zambian, 877 Turkish, and 632 Dutch pupils from the highest two grades of primary and the lowest two grades of secondary school. All items were constructed using item-generating rules. Three types of equivalence were examined: structural equivalence (Does an instrument measure the same psychological concept in each country?), measurement unit equivalence (Do the scales have the same metric in each country?), and full score equivalence (full comparability of scores across countries). Structural and measurement unit equivalence were examined in two ways. First, a MIMIC (multiple indicators, multiple causes) structural equation model was fitted, with tasks for component processes as input and inductive reasoning tasks as output. Second, using a linear logistic model, the relationship between item difficulties and the difficulties of their constituent item-generating rules was

examined in each country. Both analyses of equivalence provided strong evidence for structural equivalence, but only partial evidence for measurement unit

(3)

Equivalence of a Measure of Inductive Reasoning in Zambia, Turkey, and The Netherlands

(4)

The validity of cross-cultural comparisons can be jeopardized by bias; examples of bias sources are country differences in stimulus familiarity (Serpell, 1979) and item translations (Ellis, 1990; Ellis, Becker, & Kimmel, 1993). Bias refers to the presence of score differences that do not reflect differences in the target construct. Much research has been reported on fair test use; the question is addressed there whether a test predicts an external criterion such as job success equally well in different ethnic, age or gender groups (e.g., Hunter, Schmidt, & Hunter, 1979). The present study does not study bias in test use but bias in test meaning; in other words, no reference is made here to social bias, unfairness, and differential predictive validity. The present study focuses on the question whether the same score but obtained in different cultural groups has the same meaning across these groups. Such scores are unbiased. Two types of approaches have been developed to deal with bias in cognitive tests. The first type, known under various labels such as culture-free, culture-fair, and culture-reduced testing (Jensen, 1980), attempts to eliminate or minimize the differential influence of cultural factors, like education, by adapting instrument features that may induce unwanted score differences across countries. Raven's Matrices Tests are often considered to exemplify this approach (e.g., Jensen, 1980). Despite the obvious importance of good test design, the approach has come under critical scrutiny; it has been argued that culture and test performance are so inextricably linked that a culture-free test does not exist (Frijda & Jahoda, 1966; Greenfield, 1997).

(5)

1998; McCrae & Costa, 1997), simultaneous components analysis (Zuckerman, Kuhlman, Thornquist, & Kiers, 1991), item bias statistics (Holland & Wainer, 1993), and structural equation modeling (Little, 1997). It is remarkable that a priori and a posteriori approaches (test adaptations and statistical techniques, respectively) have almost never been combined, despite their common aim, mutual relevance, and complementarity.

The present paper attempts to integrate a priori and a posteriori approaches and takes equivalence as a starting point. Equivalence refers to the similarity of psychological meaning across cultural groups (i.e., the absence of bias). Three hierarchical types of equivalence can be envisaged (Van de Vijver & Leung, 1997a, b). At the lowest level the issue of similarity of a psychological construct, as

measured by a test in different cultures, is addressed. An instrument shows structural (also called functional) equivalence if it measures the same construct in each cultural population studied. There is no claim that scores or measurement units are comparable across cultures. In fact, instruments may be different across

(6)

across cultural groups. The third and highest level is called full score equivalence and refers to identity of both scale units and origins. Only in the latter case, scores can be compared both within and across cultures using techniques like t tests and analyses of (co)variance.

Full score equivalence assumes the complete absence of bias in the measurement. Score differences between and within cultures are entirely due to inductive reasoning. There are no fully adequate statistical tests of full score equivalence, but some go a long way. The first is indirect and involves the use of additional variables to (dis)confirm a particular interpretation of cross-cultural score differences (Poortinga & Van de Vijver, 1987). Suppose that Raven’s Standard Progressive Matrices Test is administered to adults in the U.S.A. and to illiterate Bushmen. It may well be that the test provides a good picture of inductive reasoning in both cultures. However, it is likely that differences between the countries are influenced by educational differences between the groups. Score differences within and across groups have a different meaning in this case. A measure of

test-wiseness or previous test exposure, administered to all participants, can be used to (dis)confirm that cross-cultural score differences are due to bias. Full score

(7)

often based on item response theory, which is applicable when a small number of cultures have been studied, is the examination of differential item functioning or item bias (e.g., Holland & Wainer, 1993; Van der Linden & Hambleton, 1997). As long as the sources of bias (such as education) affect all items in a more or less uniform way, no statistical techniques will indicate that between-group differences are of a different nature than within-group differences. Only if bias affects some items, the proposed techniques can identify it. In sum, the establishment of full score

equivalence is an intricate issue. In many empirical studies dealing with mental tests, this form of equivalence is merely assumed. As a consequence, statements about the size of cross-cultural score differences often have an unknown validity. Sternberg and Kaufman’s (1998) observation that we know that there are population differences in human abilities, but that their nature is elusive, is very pertinent.

In line with current thinking in validity theory (Embretson, 1983; Messick, 1988), the present study combines test design and statistical analyses to deal with bias (and equivalence). A distinction is made between internal and external

procedures to establish equivalence, depending on whether the procedure is based on information derived from the scrutinized test itself (internal) or from additional tests (external).

(8)

countries. Three components are presumably relevant in the types of inductive reasoning tasks studied here (Sternberg, 1977). The first is classification: treating stimuli as exemplars of higher order concepts (e.g., the set CDEF as four

(9)

Method Participants

An important consideration in the choice of countries was the presumed strong influence of schooling on test performance (Van de Vijver, 1997); the

expenditure per head on education, a proxy for school quality, is strongly influenced by national affluence. Countries with considerable differences in school systems and educational expenditures per child were chosen. Furthermore, inclusion of at least three different cultural groups decreases the number of alternative hypotheses to explain cross-cultural differences (Campbell & Naroll, 1972). Zambia, Turkey, and the Netherlands show considerable differences in educational systems and GDP (per capita); the GDP figures per capita for 1995 were US$ 382, 2,814, and 25,635 for the three countries, respectively. School life expectancy of the three countries is 7.3, 9.7, and 15.5 year (United Nations, 1999). The choice of Zambia was also made because of its lingua franca in school; English is the school language in Zambia which was convenient for developing and administering tasks.

(10)

21%; Bemba, 13%; and Nyanja, 11%); the Turkish groups was 99% Turkish, while in the Dutch group 93% were Dutch, 2% Moroccan, and 2% Turkish.

Primary schooling in Turkey has five grades; pupils from the fifth grade of primary school and the first three grades of secondary school were involved. Secondary education is markedly different in the three countries. In Zambia a nation-wide examination (with tests for reasoning and school achievement) at the end of the last grade of primary school, Grade 7, is utilized to select pupils for secondary school. After the seventh Grade less than 20% pupils continue their education in either public or private secondary schools. Admittance to public schools is conditional on the score at the Grade 7 Examination. Cutoff scores vary per region and depend on the number of places available in secondary schools. In urban areas there are some private schools; admittance to these schools usually does not depend on examination results, but is mainly dependent on availability of places as well as the ability and willingness of parents to pay school fees.

Participants both from public and private schools were included in our study. The tremendous dropout at the end of Grade VII has undoubtedly adversely affects the generalizability of the data to the Zambian population at large and it also jeopardized the comparability of the age cohorts, both within Zambia and across the three

(11)

Insert Table 1 about here

Sample sizes are presented in Table 1; of the participants recruited 56% came form urban and 44% from rural schools; 46% was female, 54% was male. Instruments

The battery consisted of eight tasks, four with figures and four with letters as stimuli. Each of these two stimulus modes had the same composition: a task of inductive reasoning and three tasks of skill components that are assumed to constitute important aspects of inductive reasoning. The first is rule classification, called encoding in Sternberg’s (1977) model of analogical reasoning. The second is rule generating, a combination of inference and mapping. The third is rule testing, a combination of comparing and justification.

All tasks are based on item-generating rules, schematically presented in Appendix A. All figure tasks are based on the following three item-generating rules:

(a) The same number of figure elements is added to subsequent figures in a period (periods consist of either circles or squares, but never of both. A period defines the number of figures that belong together. Examples of items of all tasks, in which the item-generating rules are illustrated, can be found in Appendix B).

(b) The same number of elements is subtracted from subsequent figures in a period.

(c) The same number of elements is, alternatingly, added to and subtracted from subsequent figures in a period.

(12)

applied to all figure tasks. First, the number of figures in a period varies from two to four. Second, the number of elements that are added to or subtracted from

successive elements of a period varied from one to three. Whenever possible, all facet levels were crossed. However, for some combinations of facet levels no item could be generated. For example, as each figure can have (in addition to a circle or a square that are present in all items) only five elements (namely a hat, arrow, dot, line, or bow), it is impossible to construct an item with two or three elements added to each of four figures in a period.

Inductive Reasoning Figures is a task of 30 items. Each item has five rows of 12 figures, the first eight are identical. One of the rows has been composed

according to a rule while in the other rows the rule has not been applied consistently. The pupil has to mark the correct row.

Besides the common facets, two additional facets were used to generate the items of Inductive Reasoning Figures. First, the figure elements added or subtracted are either the same or different across periods. In the example of Appendix B there is a constant variation because in each period a dot is added first, followed by a dash and a hat. Second, periods do or do not repeat one another, meaning that the first figures of each period are identical (except for a possible swap of circle and square).

The 36 items of Rule Classification Figures consist of eight figures. Below these figures the three item-generating rules were printed. In addition, the

alternative "None of the rules applies" has been added. The pupil had to indicate which of the four alternatives applies to the eight figures above.

(13)

presence or absence of periodicity cues. These cues refer to the presence of both circles and squares in an item (as illustrated in the first item of Appendix B) or the presence of either squares or circles (if all circles of the example would be changed into squares, no periodicity cues would be present).

Whereas in Inductive Reasoning Figures the number of different elements of a figure could be either one, two, or three, Rule Classification Figures has another level of this facet, referring to a variable number of elements. For example, in the first period one element is added to subsequent figures and in the second period two elements.

Each of the 36 items of Rule Generating Figures consists of a set of six figures under which three lines with the numbers 1 to 6 are printed. In each item one, two, or three triplets (i.e., groups of three figures) have been composed according to one of the item-generating rules. Any of the six figures of an item can be part of one, two, or three triplets. Pupils were asked to indicate all triplets that constitute valid periods of figures. No information about the number of valid triplets in each particular item was given. The total number of triplets was 63. In the data analysis these were treated as separate, dichotomously scored items.

Two facets, in addition to the common ones, were included. First, periodicity cues are either present or absent; the facet has the same meaning as in Rule Classification Figures. Second, the number of valid triplets is one, two, or three.

A verbal specification is given at the top of each item of Rule Testing Figures. In this specification three characteristics of the item are given, namely the

(14)

specification four rows of eight figures have been drawn. One of the rows of eight figures has been composed completely according to the specification. In some items none of the four rows has been composed according to the specification. In this case a fifth response alternative, "None of the rows has been composed according to the specification" applies. This facet is labeled “None/one of the rules applies”. The pupil has to mark the correct answer.

The facets and facet levels of Rule Testing Figures and Rule Classification Figures were identical. In addition, the facet “Rows (do not) repeat each other” is included in the former task. In some items the rows are fairly similar to each other (except for minor variations that were essential for the solution), while in other items each row has a completely different set of eight figures.

The letter tasks were based on five item-generating rules:

(a) Each group of letters has the same number of vowels. The vowels used in the task are A, E, I, O, and U. As the status of the letter Y can easily create confusion in English and Dutch where it can be both a consonant and a vowel, the letter was never used in connection to the first item-generating rule;

(b) Each group of letters has an equal number of identical letters that are the same across groups (e.g., BBBB BBBB);

(c) Each group of letters has an equal number of identical letters that are not the same across groups (e.g., GGGG LLLL);

(15)

(e) Each group of letters has a number of letters that appear the same (i.e., 1, 2, 3, or 4) number of positions before each other in the alphabet.

A second facet refers to the number of letters to which the rule applies. The number could vary from 1 to 6. All items of the letter tasks are based on a combination of the two facets described (i.e., item rule and number of letters). Like in the figure tasks, not all combinations of the facets are possible; for example, applications of the fourth and fifth rule assume an item rule that is based on at least two letters in a group.

Inductive Reasoning Letters bears resemblance to the Letter Sets Test in the ETS-Kit of Factor-Referenced Tests (Ekstrom, French, & Harman, 1976). Each of the 45 items consists of five groups of six letters. Four out of these five groups are based on the same combination of the two facets (e.g., they all have two vowels). The pupil has to mark the odd one out.

Each of the 36 items of Rule Classification Letters consists of three groups of six letters. Most items have been constructed according to some combination of the two facets, while in some items the combination has not been used consistently. The five item-generating rules are printed under the groups of letters. The pupil has to indicate which item-generating rule underlies the item. If the rule has not been applied consistently, the sixth response alternative, "None of the rules applies," is the correct answer. Like in Rule Classification Figures, the facet “Non/one of the rules applies” is included.

(16)

were not informed about the exact number of triplets in each item. The total number of triplets is 90, which are treated as separate items in the data analysis.

Like in Rule Generating figures, a facet about the number of valid triplets (ranging here from 1 to 5) applies to the items, in addition to the common facets. Rule Testing Letters consists of 36 items. Each item starts with a verbal specification. In the specification two characteristics of the item are given, namely the item-generating rule and the number of letters pertaining to the rule (e.g., "In each group of letters there are 3 vowels"). Below this specification four rows of three groups of six letters are printed. In most items one of the rows figures has been composed according to the specification. The pupil has to find this row. In some items none of the four rows has been composed according to the specification. In this case a fifth response alternative, "None of the rows has been composed according to the specification above the item," applies. The pupil has to indicate which of the five alternatives applies.

The task had three facets, besides the common ones. Like in Rule Classification Letters, there is a facet indicating whether or not one of the alternatives follows the rule. Also, like in Rule Testing Figures, there is a facet indicating whether or not rows repeat each other.

The Turkish and Roman alphabet are not entirely identical. The (Roman) letters Q, W, and X were not used here since they are uncommon in Turkish. The presence of specifically Turkish letters, such as Ç, Ö, and Ü, necessitated the introduction of small changes in the stimulus material (e.g., the sequence ABCD in the Zambian and Dutch stimulus materials was changed into ABCÇ in Turkish).

(17)

The tasks were administered without time limit to all pupils of a class; however, in the rural areas in Zambia the number of desks available was often insufficient for all pupils to work simultaneously as each pupil had to have his or her own test booklet and answer sheet. The experimenter then selected randomly a number of participants.

The tasks were administered by local testers. The number of testers in Zambia, Turkey, and the Netherlands were two, three, and two (the author being one of them), respectively. Five were psychologists and three were experienced psychological assistants. All testers followed a one-day training in the administration of the tasks.

In Zambia English was used in the administration. A supplementary sheet in Nyanja, the main language of the Lusaka region, was included in the test booklet that explained the item-generating rules. Turkish was the testing language in the Turkish group Turkish and Dutch in the Dutch group.

(18)

were given on one day (either the first or the second testing day) and the remaining on the other one.

The description of all eight instruments started with a one-page description of the task, which was read aloud by the experimenter to the pupils; item-generating rules of the stimulus mode were specified. This instruction was included in the pupils’ test booklets. Examples were then presented of each of the item-generating rules; explicit reference was made to which rule applied. Finally, the pupils were asked to answer a number of exercises that again, covered all item-generating rules. After this instruction, the pupils were asked to answer the actual items. In each figure task the serial position of each figure was printed on top of the item in order to minimize the computational load of the task. The alphabet was printed at the top of each page of the letter tasks, with the vowels underlined. It was indicated to the pupils that they were allowed to look back at the instructions and examples (e.g., to consult the item-generating rules). Experience showed that this was

infrequently done, probably because all tasks of a single stimulus mode utilized the same rules.

Results

The section begins with a description of preliminary analyses, followed by the main analyses. Per analysis, the hypothesis, statistical procedure and findings are reported.

Preliminary Analyses

(19)

Figures .89 (.84-.95), Rule Testing Figures .85 (.81-.89), Inductive Reasoning Letters .79 (.69-.88), Rule Classification Letters .83 (.73-.90), Rule Generating Letters .93 (.90-.95), and Rule Testing Letters .78 (.63-.85). Overall, the internal consistencies yielded adequate values. Country differences were examined in a procedure described by Hakstian and Whalen (1976). Data of all grades were combined. The M statistic, that follows a chi square distribution with two degrees of freedom, was significant for Inductive Reasoning Figures (M = 64.92, p < .001), Rule Classification Figures (M = 10.57, p < .01), Rule Generating Figures (M = 34.06, p < .001), Rule Classification Letters (M = 11.57, p < .01), and Rule Testing Letters (M = 12.40, p < .01). The Dutch group tended to have lower internal consistencies (a possible explanation is given later).

Insert Table 2 and 3 about here

The average proportions of correctly solved items per country, grade, and task are given in Table 2. Differences in average scores were tested in a

(20)

tasks, usually explaining more than 10%. Zambian pupils tended to show the lowest scores and Dutch pupils the highest scores. Grade differences were as expected; as can be confirmed in Table 2, scores increased with grade. The effect sizes were substantial, usually larger than 10%, and highly significant for all tasks (p < .001). Gender differences were small; significant differences were found for Rule Testing Figures and Inductive Thinking Letters (girls scored higher on both tasks), but gender differences did not explain more than 1% on any task. The country by grade interaction was significant in all analyses, explaining between 1 and 5%. As can be seen in Table 2, score increases with grade tended to be smaller in the Netherlands than in the other two countries. Country differences in scores were large in all

grades but tended to be become smaller with age. These results are in line with a meta-analysis (Van de Vijver, 1997) in which in the age range examined here, there was no increase of cross-cultural score differences with age (contrary to what would be predicted from Jensen’s, 1977, cumulative deficit hypothesis). Other interactions were usually smaller and often not significant.

Structural Equivalence in Internal Procedure

Hypothesis. The first hypothesis addresses equivalence in internal

procedures by examining the decomposition of the item difficulties. The hypothesis states that facet levels provide an adequate decomposition of the item difficulties of each task in each country (Hypothesis 1a). See Table 4 for an overview of the hypotheses and their tests.

(21)

model holds that the probability that a subject k (k = 1, …, K) responds correctly to item i is given by

exp(k - i)/[1+ exp(k - i)], (1)

in which k represents the person’s ability and i the item difficulty. An item is represented by only one parameter, namely its difficulty (unlike some other models in item response theory in which each item also has a discrimination parameter, sometimes in addition to a pseudo-guessing parameter). A sufficient statistic for estimating a person’s ability is the total number of correctly solved items on the task. Analogously, the number of correct responses at an item provides a sufficient statistic for estimating the item difficulty. For our present purposes the main interest is in item parameters.

The LLM imposes a constraint on the item parameter by specifying that the item difficulty is the sum of an intercept  (that is irrelevant here) and a sum of underlying facet level difficulties, j:

i = + qij j (2)

The second step aims at estimating the facet level difficulties (). Suppose that the item is “BBBBNM BBBBKJ BBBBHJ BBFTHG BBBBHN”. In terms of the facets, the item can be classified as involving (a) four letters (facet: number of letters); (b) equal letters within and across groups of letters (facet: item rule). The above model

(22)

facet level difficulties (namely the difficulty parameter of items dealing with four letters and the difficulty parameter of items dealing with equal letters within and across groups of letters), and a residual component.

The matrix Q (with elements qij) defines the independent variable; the matrix has m rows (the number of items of the task) and n columns (the number of

independent facet levels of the task). Entries of the Q matrix are zero or one

depending on whether the facet level is absent or present in the item (interactions of facets were not examined). In order to guarantee uniqueness of the parameter estimates in the LLM, linear dependencies in the design matrix were removed by leaving the first level of each facet out of the design matrix. This (arbitrary) choice implied that the first level of each facet has a difficulty level of zero and that the size and significance of other facet levels should be interpreted relative to this “anchor.”

The sufficient statistic for estimating the basic parameters is the number of correct answers at the items that make up the facet level. As a consequence, there will be a perfect rank order between this number of correct answers and j. Various procedures have been developed to estimate the basic parameters. In the present study conditional maximum likelihood estimation was used (details of this

computationally rather involved procedure are given by Fischer, 1974, 1995). An important property of the LLM is the sample independence of its parameters; estimates of the item difficulty and the basic parameters are not influenced by the overall ability level of the pupils. This property is attractive here because it allows for their estimation, even when average scores of cultural groups differ.

(23)

second step the parameters of equation (2) are estimated. The item parameters are used in the evaluation of the fit of the model. The fit of an LLM can be evaluated in various ways. First, a likelihood ratio test can be computed, comparing the likelihood of the (unrestricted) Rasch model to the (restricted) LLM. The statistic is of limited value here. The ratio is affected by guessing (Van de Vijver, 1986). Because all tasks employed a multiple-choice format, it is unrealistic to assume that a Rasch model would hold. The usage of an LLM may seem questionable here because of the occurrence of guessing (pupils were instructed to answer all items). However, Van de Vijver (1986) has shown that guessing gives rise to a reduction of the variance of the estimated person and item parameters but correlations of both estimated parameters with their true values are hardly affected. A useful heuristic to evaluate the degree of fit of the LLM is provided by the correlation between the Rasch parameters of the first step of the analysis and the by means of the design matrix reconstructed item parameters of the second step. It amounts to correlating item parameters of the first step () (the “unfaceted item difficulties”) with the item parameters of the second step, using i* =  qij j (the “faceted item difficulties”). The latter vector gives the item parameters estimated on the basis of the estimated facet level difficulties. Higher correlations point to a better approximation of item level difficulties by facet level difficulties and hence, to a better modelability of inductive reasoning.

Every task has its own design matrix, consisting both of facets that were common to all tasks of a mode (e.g., the item-generating rules) and task-specific facets (e.g., the number of correct answers in the rule generating tasks). The

(24)

values per country and grade) and the Q matrix (invariant across grades and countries for a specific task) were input to the analyses. This procedure was repeated for each task, making a total of 8 (tasks) x 4 (grades) x 3 countries = 96 analyses.

The LLM is applied here as one of two tests of structural equivalence. This type of equivalence addresses the relationship between measurement outcomes and the underlying construct. The facets of the tasks are assumed to influence the difficulty of the items. For example, it can be expected that rules in items of the letter tasks are easier when they involve more letters. The analysis of structural

equivalence examines whether the facets exert an influence on item difficulty in each culture. In more operational terms, structural equivalence is supported if the correlation of each analysis is significantly larger than zero. A significant correlation points to a contribution of the facet levels to the item difficulty: It indicates that the facet levels contribute to the prediction of item difficulties.

Insert Table 5 about here

Hypothesis test. As can be seen in Table 5, the correlations between the unfaceted Rasch item parameters (of equation 1) and the faceted item parameters (of equation 2) were high for all tasks in each grade in each country. These high correlations provide powerful evidence that the same facets influence item difficulty in each country. It can be concluded that Hypothesis 1a, according to which the item difficulty decomposition would be adequate in each country was strongly supported.

(25)

The second question involves the patterning of the correlations of Table 5. This question was addressed in an analysis of variance, with country (3 levels: Zambia, Turkey, and the Netherlands), stimulus mode (2 levels: figure and letters tests), and type of skill (4 levels: inductive reasoning and each of the three skill component tasks) as independent variables; the correlation was the dependent variable. The four grades were treated as independent replications. As can be seen in Table 6, all main effects and first order interactions were significant. A significantly lower correlation (and hence a poorer fit of the data to the model) was found for figure tasks than for letter tasks (averages of .87 and .91, respectively), F(2, 72) = 24.85, p < .001. The effect was considerable, explaining 22% of the total score variation. About the same percentage was explained by skill components, F(3, 72) = 37.94, p < .001. The lowest correlation (of .87) was obtained for rule classifications and rule generating (.87), followed by inductive reasoning (.89) while rule testing showed the highest value (.93). The high values of the latter may be due to a combination of a large number of items (long tests) and the presence of both very easy and difficult facet levels in the rule testing tasks in each country; such facet levels increase the dispersion and will give rise to high correlations. Country differences explained about 10% of the score variation; the correlations of the Turkish and Zambian groups were very close to each other (.91 and .90,

(26)

score variation, was observed between country and stimulus mode, F(2, 72) = 19.92, p < .001. Whereas the correlations did not differ more than .03 for both stimulus modes in Zambia and Turkey, the difference in the Dutch sample was .09.

The interaction of stimulus mode and skill component was also significant, F(3, 72) = 9.61, p < .001. Correlations of rule generating and rule testing were on average .03 larger than in the figure mode than in the letter mode, while a much larger value of .08 was observed for rule classification. The interaction of country and skill was significant though less important (explaining only 5%). The score differences of the cultures were relatively small for rule generating and rule testing and much larger for inductive reasoning and rule classification, mainly due to the relatively low scores of the Dutch. Again, ceiling effects may have induced the effect (not necessarily the largest attainable score because some facet levels remained beyond the reach of many pupils even the highest scorers). In sum, the analysis of the correlations revealed high values for all tasks in the three countries. The

observed country differences were presumably more due to ceiling effects than to country differences in modelability of inductive reasoning and its components. Ceiling effects may also explain the lower internal consistencies in the Dutch data, discussed before.

Insert Figure 1 and 2 about here

(27)

both Figures is the proximity of the three country curves; this points to the cross-cultural similarity in pattern of difficult and easy facet levels, which yields further evidence for the structural equivalence of the instruments in the present samples. Furthermore, most facet levels behaved as expected. As for the figure tasks, the third item-generating rule (about alternating additions and subtractions) was invariably the most difficult. Items were more difficult when they dealt with shorter periods, when a variable number of elements were added or subtracted in

subsequent figures of a period, when periodicity cues were absent, and when periods did not repeat each other. The number of valid triplets (only present in the rule-generating task) showed large variation. Pupils found it relatively easy to

retrieve one correct solution, but relatively difficult to find all solutions when the item contained two or three valid triplets.

(28)

Measurement Unit Equivalence in Internal Procedure

Hypothesis. For each task the same facet level difficulties apply in each country (Hypothesis 1b; cf. Table 4).

Statistical procedure. The LLM parameters can also be used to test measurement unit equivalence. This type of equivalence goes beyond structural equivalence by assuming that the tasks as applied in the three countries have the same measurement units (but not necessarily the same scale origins). If the estimated  parameters of equation 2 are invariant across countries except for random fluctuations, there is strong evidence for the invariance of the measurement units of the test scales. This invariance would imply that the estimated facet level difficulties in a particular country could be replaced by the difficulty of the same facet in another country without affecting the fit of the model. For these analyses the data for the grades in a country were combined because of the primary interest in country differences.

Hypothesis test. Standard errors of the estimated facet level difficulties

ranged from 0.05 to 0.10. As can be derived from Figure 1 and 2, in each task there are facet levels that differ significantly across countries. It can be safely concluded that scores did not show complete measurement unit equivalence.

Yet, it is also clear from these Figures that some facet levels are not

significantly different across countries. So, the question arises to what extent facet levels are identical across countries. The question was addressed using intraclass correlations, measuring the absolute agreement of the estimated facet level

(29)

correlation of the country by facet level matrix was computed. The letter tasks showed consistently higher values than the figure tasks. The average agreement coefficient was .91 for the figure tasks and .96 for the letter tasks (all intraclass correlations were significantly above zero, p < .001). The high within-task agreement points to an overall strong agreement of facet levels across countries. The estimated facet level difficulties come close to being interchangeable across countries (despite the significant differences of some facet levels).

A recurrent theme in the analysis is the better modelability of the letter tasks as compared to the figure tasks, due to wider range of facet level difficulties in the letter than in the figure mode. The range differences may be a consequence of the choices made in the test design stage. One of the problems of many existing figure tests is their often implicit definition of permitted stimulus transformations (e.g., rotating and flipping). This lack of clarity, presumably an important source of cross-cultural score differences, was avoided in the present study by spelling out all permitted transformations in the test instructions. Apparently, the price to be paid for providing the pupils with this information is a small variation in facet level difficulties.

Structural Equivalence in External Procedure

Hypothesis. The skill components contribute to inductive reasoning in each country (Hypothesis 2a; cf. Table 4).

Insert Figure 3 about here

(30)

type of structural equation model was used, namely a MIMIC model (Multiple Indicators MultIple Causes; see Van Haaften & Van de Vijver, 1996, for another cross-cultural application). A MIMIC is a model that links input and output through one latent variable (see Figure 3). The core of the model is the latent variable, labeled inductive reasoning. This variable, , is measured by the two tasks of inductive reasoning (the output variables). The input to the inductive reasoning factor comes from the skill components; the components are said to influence the latent factor and this influence is reflected in the two tasks of inductive reasoning. In sum, the MIMIC model states that inductive reasoning is measured by two tasks (IRF and IRL) and is influenced by three components (classification, rule generating, and rule testing). The model equations are as follows:

y1 = 1  + 1; (3)

y2 = 2  +  2,

in which y1 and y2 denote observed scores on the two tasks of inductive thinking, 1 and 2 the factor loadings, and 1 and 2 error components. The latent variable, , is linked to the skill components in a linear regression function:

 = 1 x1 + 2 x2 +3 x3 + , (4) where the gammas are the regression coefficients, the x-variables refer to the skill components, and is the error component. In order to make the estimates

identifiable, the factor loading of IRF, 1, was fixed at one.

(31)

The theoretical model underlying the study stipulates that the three skill components constitute essential elements of inductive reasoning. In terms of the MIMIC analysis, this means that structural equivalence would be supported by a good fit of a model with three input and two output variables as described. Nested models were analyzed. In the first step all parameters were held fixed across data sets, while in subsequent steps similarity constraints were lifted in the following order (cf. Table 7): the error variance (unreliability) of the tasks of inductive

reasoning, the intercorrelations of the tasks of skill component, the error variance of the latent variable, the regression coefficients, and the factor loadings. The order was chosen in such a way that invariance of relationships involving the latent variable (i.e., regression coefficients and factor loadings) was retained as long as possible. More precisely, structural equivalence would be supported when the MIMIC model with the fewest equality constraints across countries shows a good fit and all MIMIC parameters differ from zero (hypothesis 2a). It would mean that the tasks of inductive reasoning constitute a single factor that is influenced by the same skill component in each analysis (the possibility that there is a good fit but that some regression coefficients or factor loadings are negative is not further considered here because no covariances were negative).

Insert Table 7 about here

(32)

fit statistics when constraints were imposed on the phi matrices (the covariance matrices of the component skills; see the figure of Appendix B); therefore, the model with equal factor loadings, regression coefficients, and error variances was chosen. Although the letter tasks showed a better fit than the figure tasks, the choice of a model was less straightforward. A MIMIC model with a similar pattern of free and fixed parameters in both stimulus modes was chosen, mainly because of parsimony (see footnote to Table 7 for a more elaborate explanation).

The standardized solution of the two models is given in Figure 3. As

hypothesized, all loadings and regression coefficients were positive and significant (p < .01). It can be concluded that inductive reasoning with figure and letter stimuli involves the same components in each country. This supports structural

equivalence, as predicted in hypothesis 2a. The regression coefficients of the figure component tasks were unequal to each other: rule classification was least important, followed by rule generating, while rule testing showed the largest contribution to inductive reasoning. The letter mode did not show this patterning; the regression coefficients of the component tasks of the letter mode were rather similar to one another.

Measurement Unit Equivalence in External Procedure

Hypothesis. The skill components contribute in the same way to inductive reasoning in each country (Hypothesis 2b; cf. Table 4).

Statistical procedure. Measurement unit equivalence can be scrutinized by introducing and testing equality constraints in the MIMIC model. This type of equivalence would be supported when a single MIMIC model with identical

(33)

the ones proposed in the literature. Whereas the latter tend to analyze all tasks in a single exploratory or confirmatory factor analysis, more specific relationships

between the tasks are considered here.

Hypothesis test. The psychologically most salient elements of the MIMIC, the factor loadings, regression coefficients, and the explained variance of the latent variable, were found to be invariant across countries. However, measurement unit equivalence also requires the other parameter matrices to be invariant. In the figure mode the model with equality constraints for all matrices showed a rather poor fit, with an NNFI of .88, a GFI of .96, and an RMSEA of .045. An inspection of the delta chi square values indicated that in particular the introduction of equality of

covariances of the skill components () reduced the fit significantly. The letter tasks showed a similar picture; the most restrictive model revealed values of .89 for the NNFI, .82 for the GFI, and .041 for the RMSEA, which can be interpreted as a rather poor fit. Again, equality of the matrices led to a significant reduction of the fit. Like in our internal procedure to examine measurement unit equivalence, we found some but inconclusive evidence for the measurement unit equivalence of the task scores across countries; hypothesis 2b had to be rejected.

Full Score Equivalence

Hypothesis. Both tasks of inductive reasoning show full score equivalence (Hypothesis 3; cf. Table 4).

(34)

than two groups and to examine both uniform and nonuniform bias (Mellenbergh, 1982). The combined samples of the three countries are used to determine cutoff scores that split up the sample in three score level groups (low, medium, and high) of about the same size. In the logistic regression procedure, culture (dummy coded), score level, and their interaction are the independent variables, while the item response is the dependent variable. A significant main effect of culture points to uniform bias: individuals from at least one country show an unexpectedly low or high score across all score levels on the item as compared to individuals with the same test score from other cultures. A significant interaction points to nonuniform bias: the systematic difference of the scores depends here on the score level; for example, country differences in scores among low scorers are not found among high scorers. Alpha was set at a (low) level of .001 in the item bias analyses in order to prevent inflation of Type I errors, due to multiple testing (although, obviously, the power of the procedure is adversely affected by this choice).

Insert Figure 4 about here

Hypothesis test. In the introduction two approaches were mentioned to examine full score equivalence that are based on structural equation modeling: multilevel covariance structure analysis and the modeling of latent means. The former could not be used due to the small number of countries involved, while the latter was precluded because of the incomplete support of measurement unit equivalence. This lack of support indeed prohibits any analysis of full score

(35)

items uniform, 11 items non-uniform), mainly involving the Dutch—Zambian comparison. The occurrence of bias was related to the difficulty of the items; both the easiest and most difficult items showed the most bias. The correlation between the presence of bias (0 = absent, 1 = present) and the deviance of the item score from the mean (i.e., average item score - overall average) was .64 (p < .001). The correlation suggests a methodological artifact, such as floor and ceiling effects. This was confirmed by an inspection of the contingency tables underlying the logistic regression analyses. Figure 4 depicts empirical item characteristic curves of two items that showed both uniform and nonuniform bias. The upper panel shows a relatively easy item (with an overall mean of .79) and the lower panel a relatively difficult item (mean of .33). The bias for the easy item is induced by country differences at the lowest score level that are not reproduced at the higher levels. Analogously, the scores for the difficult item remain close to the guessing level (of . 20) in the two lowest score levels, while there is more score differentiation in the highest scoring group. The score patterns of Figure 4 were found for several items. It appears that ceiling and floor effects led to item bias in the IRF.

(36)

were no facet levels that were either too difficult or too easy for most of the sample, but both types of facet levels were present in the IRL.

Discussion

The equivalence of two tasks of inductive reasoning was examined in a cross-cultural study involving 632 Dutch, 877 Turkish, and 704 Zambian pupils from the highest two grades from primary and the lowest two grades from secondary school. Two stimulus modes were examined: letters and figures. In each mode tasks for inductive reasoning and for each of its components, classification, generation, and testing, were administered. The structural, measurement unit, and full score equivalence of the instruments in these countries were studied. A MIMIC model was fitted to the data, linking skill components to inductive reasoning through a latent variable, labeled inductive reasoning (external procedure). A linear logistic model was utilized to examine to what extent in each country item difficulties could be adequately decomposed into the underlying rules that were used to generate the items (internal procedure). In keeping with past research, structural equivalence was strongly supported; yet, measurement unit equivalence was not fully supported. It is interesting to note that two different statistical models, item response theory (LLM) and structural equation modeling (MIMIC), looking at different aspects of the data (facet level difficulties in the LLM and covariances of tasks with componential skills) yielded the same conclusion about measurement unit equivalence.

Critics might argue that the emphasis on equivalence of the present study is a misnomer and detracts the attention form the real cross-cultural differences

(37)

position, and Dutch pupils having the highest level. The validity of this conclusion is underscored by the LLM analyses in which it was found that most facet level

difficulties are identical and interchangeable across the three countries while a small number is country dependent. Even if score comparisons are restricted to the facet levels with the same difficulties, at least some of the score differences of the

countries are likely to remain. In this line of reasoning the current study has demonstrated the presence of at least some but presumably large differences in inductive reasoning, with Western subjects showing the highest skill levels. In my view the interpretation is based on a simplistic and untenable view on country score differences. These differences are not just a matter of differences in inductive reasoning skills. It may well be that differences of country scores on the tasks are partly or entirely due to additional factors. Various educational factors may play a role here, as is often the case in comparisons of highly dissimilar cultural groups. In a meta-analysis Van de Vijver (1997) has found that educational expenditure is a significant predictor of country differences in mental test performance. Does quality of schooling have an influence on inductive reasoning? I concur with Cole (1996), who after reviewing the available cross-cultural evidence, concluded that schooling does not have a formative influence on higher-order forms of thinking but tends to broaden the domains in which these skills can be successfully applied. Schooling facilitates the usage of skills by their training and by exposure to psychological and educational tests (cf. Rogoff, 1981; Serpell, 1993). The educational differences of the populations of the current study are massive. For example, attending

(38)

observed in the present study as reflecting real differences is based on an

underestimation of the impact of various context-related (educational) factors and an overestimation of ability of the tasks employed here to measure inductive reasoning in all countries. Tasks that capitalize less on schooling and are more derived from everyday experiences may show a different patterning of country differences.

The present results replicate the findings of many studies on structural equivalence; strong support was found that the instruments measure inductive reasoning in the three countries. The present results make it very unlikely that there are major cross-cultural differences in strategies and processes involved in inductive reasoning in the populations studied. These results extend findings of numerous factor analytic studies in showing that skill components contribute in a largely

identical way to inductive thinking and item difficulty is governed by complexity rules that are largely identical across cultures.

(39)

has not been tested or when only structural equivalence has been observed. The present study underscores the need to study equivalence of data before comparing test scores. A more prudent treatment of cross-cultural score differences is badly needed. We have firmly established the commonality of basic cognitive functions in several cultural and ethnic groups (Waitz’s “psychic unity”), but we still have to come to grips with the question of how to design cognitive tests that allow for numerical score comparisons across a wide cultural range.

A final issue concerns the external validity of the present findings: To what populations can the present results be generalized? The three countries involved in the study have a highly different status on affluence. Given the strong findings on structural equivalence, it is realistic to assume that inductive reasoning is a universal with largely identical components in schooled populations, at least as of the end of primary school. Future studies should address the question of whether

(40)

References

Barrett, P. T., Petrides, K. V., Eysenck, S. B. G., & Eysenck, H. J. (1998). The Eysenck Personality Questionnaire: An examination of the factorial similarity of P, E, N, and L across 34 countries. Personality and Individual Differences, 25, 805-819.

Campbell, D. T., & Naroll, R. (1972). The mutual methodological relevance of anthropology and psychology. In F. L. K. Hsu (Ed.), Psychological anthropology. Cambridge, MA: Schenkman.

Carroll, J. B. (1993). Human cognitive abilities. A survey of factor-analytic studies. Cambridge: Cambridge University Press.

Claassen, N. C., & Cudeck, R. (1985). Die faktorstruktuur van die Nuwe Suid-Afrikaanse Groeptoets (NSAG) by verskillende bevolkingsgroepe [The factor

structure of the New South African Group Test (NSAGT) in various population groups.]. South-African Journal of Psychology, 15, 1-10.

Cole, M. (1996). Cultural psychology: A once and future discipline. Cambridge, MA: Harvard University Press.

Ekstrom, R. B., French, J. W., & Harman, H. H. (1976). Kit of factor-referenced tests. Princeton, NJ: Educational Testing Service.

Ellis, B. B. (1990). Assessing intelligence cross-nationally: A case for differential item functioning detection. Intelligence, 14, 61-78.

Ellis, B. B., Becker, P., & Kimmel, H. D. (1993). An item response theory evaluation of an English version of the Trier Personality Inventory (TPI). Journal of Cross-Cultural Psychology, 24, 133-148.

(41)

Fan, X., Willson, V. L., & Reynolds, C. R. (1995). Assessing the similarity of the factor structure of the K-ABC for African-American and White children. Journal of Psychoeducational Assessment, 13, 120-131.

Fischer, G. H. (1974). Einführung in die Theorie psychologischer Tests [Introduction to the theory of psychological tests]. Bern: Huber.

Fischer, G. H. (1995). The linear logistic test model. In G. H. Fischer & I. W. Molenaar (Eds.), Rasch models. Foundations, recent developments and

applications. New York: Springer.

Frijda, N., & Jahoda, G. (1966). On the scope and methods of cross-cultural research. International Journal of Psychology, 1, 109-127.

Geary, D. C., & Whitworth, R. H. (1988). Is the factor structure of the WISC-R different for Anglo- and Mexican-American children? Journal of Psychoeducational Assessment, 6, 253-260.

Greenfield, P. M. (1997). You can't take it with you: Why ability assessments don't cross cultures. American Psychologist, 52, 1115-1124.

Gustafsson, J-E. (1984). A unifying model for the structure of intellectual abilities. Intelligence, 8, 179-203.

Hakstian, A. R., & Vandenberg, S. G. (1979). The cross-cultural

generalizability of a higher-order cognitive structure model. Intelligence, 3, 73-103. Hakstian, A. R., & Whalen, T. E. (1976). A k-sample significance test for independent alpha coefficients. Psychometrika, 41, 219-231.

(42)

Holland, P. W., & Wainer, H. (Eds.) (1993). Differential item functioning. Hillsdale, NJ: Erlbaum.

Hunter, J. E., Schmidt, F. L., & Hunter, R. (1979). Differential validity of employment tests by race: A comprehensive review and analysis. Psychological Bulletin, 86, 721-735.

Irvine, S. H. (1969). Factor analysis of African abilities and attainments: Constructs across cultures. Psychological Bulletin, 71, 20-32.

Irvine, S. H. (1979). The place of factor analysis in cross-cultural methodology and its contribution to cognitive theory. In L. Eckensberger, W. Lonner, & Y. H. Poortinga (Eds.), Cross-cultural contributions to psychology. Lisse, the Netherlands: Swets & Zeitlinger.

Irvine, S. H., & Berry, J. W. (1988). The abilities of mankind: A revaluation. In S. H. Irvine & J. W. Berry (Eds.), Human abilities in cultural context. Cambridge: Cambridge University Press.

Jahoda, G., & Krewer, B. (1997). History of cross-cultural and cultural

psychology. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.), Handbook of cross-cultural psychology (2nd ed., vol. 1). Chicago: Allyn & Bacon.

Jensen, A. R. (1977). Cumulative deficit in intelligence of Blacks in the rural South. Developmental Psychology, 13, 184-191.

Jensen, A. R. (1980). Bias in mental testing. New York: Free Press. Little, T. D. (1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53-76.

(43)

Mellenbergh, G. J. (1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7, 105-118.

Messick, S. (1988). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed). Hillsdale, NJ: Erlbaum.

Muthén, B. O. (1991). Multilevel factor analysis of class and student achievement components. Journal of Educational Measurement, 28, 338-354.

Muthén, B. O. (1994). Multilevel covariance structure analysis. Sociological Methods & Research, 22, 376-398.

Naglieri, J. A., & Jensen, A. R. (1987). Comparison of Black-White differences on the WISC-R and the K-ABC: Spearman's hypothesis. Intelligence, 11, 21-43.

Poortinga, Y. H., & Van de Vijver, F. J. R. (1987). Explaining cross-cultural differences: Bias analysis and beyond. Journal of Cross-Cultural Psychology, 18, 259-282.

Ree, M. J., & Carretta, T. R. (1995). Group differences in aptitude factor structure on the ASVAB. Educational and Psychological Measurement, 55, 268-277.

Reschly, D. (1978). WISC-R factor structures among Anglos, Blacks, Chicanos, and Native-American Papagos. Journal of Consulting and Clinical Psychology, 46, 417-422.

Rogers, H. J., & Swaminathan, H. (1993). A comparison of logistic regression and Mantel-Haenszel procedures for detecting differential item functioning. Applied Psychological Measurement, 17, 105-116.

(44)

Sandoval, J. (1982). The WISC-R factorial validity for minority groups and Spearman's hypothesis. Journal of School Psychology, 20, 198-204.

Serpell, R. (1979). How specific are perceptual skills? British Journal of Psychology, 70, 365-380.

Serpell, R. (1993). The significance of schooling. Life journeys in an African society. Cambridge: Cambridge University Press.

Sternberg, R. J. (1977). Intelligence, information processing, and analogical reasoning: The componential analysis of human abilities. New York: Wiley.

Sternberg, R. J., & Kaufman, J. C. (1998). Human abilities. Annual Review of Psychology, 49, 479-502.

Sung, Y. H., & Dawis, R. V. (1981). Level and factor structure differences in selected abilities across race and sex groups. Journal of Applied Psychology, 66, 613-624.

Taylor, R. L., & Ziegler, E. W. (1987). Comparison of the first principal factor on the WISC-R across ethnic groups. Educational and Psychological Measurement, 47, 691-694.

Thurstone, L. L. (1938). Primary mental abilities. Psychometric Monographs, No. 1.

United Nations (1999). Indicators on education [On-line]. Available Internet: www.un.org/depts/unsd/social/education.htm.

Valencia, R. R., & Rankin, R. J. (1986). Factor analysis of the K-ABC for groups of Anglo and Mexican American children. Journal of Educational

(45)

Valencia, R. R., Rankin, R. J., & Oakland, T. (1997). WISC-R factor structures among White, Mexican American, and African American children: A research note. Psychology in the Schools, 34, 11-16.

Van de Vijver, F. J. R. (1986). The robustness of Rasch estimates. Applied Psychological Measurement, 10, 45-57.

Van de Vijver, F. J. R. (1997). Meta-analysis of cross-cultural comparisons of cognitive test performance. Journal of Cross-Cultural Psychology, 28, 678-709.

Van de Vijver, F. J. R., & Leung, K. (1997a). Methods and data analysis of comparative research. In J. W. Berry, Y. H. Poortinga, & J. Pandey (Eds.),

Handbook of cross-cultural psychology, 2nd Ed., Vol. 1. Chicago: Allyn & Bacon. Van de Vijver, F. J. R., & Leung, K. (1997b). Methods and data analysis for cross-cultural research. Newbury Park, CA: Sage.

Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory. New York: Springer.

Van Haaften, E. H., & Van de Vijver, F. J. R. (1996). Psychological

consequences of environmental degradation. Journal of Health Psychology, 1, 411-429.

Willemsen, M. E., & Van de Vijver, F. J. R. (under review). Context effects in logical reasoning in the Netherlands and Zambia.

Zuckerman, M., Kuhlman, D. M., Thornquist, M., & Kiers, H. A. L. (1991). Five (or three) robust questionnaire scale factors of personality without culture.

(46)

Table 1

Sample Size per Culture, Grade, and Experimental Condition Gradea

Country Test condition 5 6 7 8 Total

Zambia Figure 80 79 94 123 376 Letter 81 81 87 79 328 Turkey Figure 127 97 95 102 421 Letter 139 107 110 100 456 Netherlands Figure 117 74 51 77 319 Letter 83 91 77 62 313 Total 627 529 514 543 2213

(47)

Table 2

Average Proportion of Correctly Solved Items per Task, Grade, and Culture Task

Country Grade IRF RCF RGF RTF IRL RCL RGL RTL

Zambia 6 .40 .44 .53 .39 .49 .40 .37 .41 7 .55 .53 .55 .43 .56 .60 .47 .56 8 .56 .64 .55 .53 .58 .61 .44 .58 9 .62 .68 .64 .54 .61 .54 .48 .58 Turkey 5 .47 .51 .44 .42 .50 .54 .39 .49 6 .48 .57 .56 .46 .52 .53 .42 .47 7 .66 .73 .64 .58 .64 .70 .56 .65 8 .65 .75 .69 .63 .64 .71 .52 .60 Netherlands 5 .67 .80 .65 .64 .60 .63 .51 .58 6 .74 .73 .72 .67 .64 .68 .57 .66 7 .70 .74 .70 .66 .68 .72 .63 .67 8 .78 .84 .76 .77 .70 .74 .60 .69

(48)

Table 3

Effect Sizes of Multivariate Analyses of Variance of the Psychological Tests per Test Mode Skill component Independent variable Multi-variatea Inductive reasoning Rule classification Rule generating Rule testing (a) Figure mode

Country (C) .135*** .132*** .183*** .139*** .200*** Grade (G) .073*** .108*** .149*** .132*** .125*** Sex (S) .011* .001 .002 .001 .010** C  G .035*** .013* .063*** .041*** .014* C  S .012** .016*** .017*** .014*** .014*** G  S .009** .004 .006 .004 .011** C  G  S .009* .010 .007 .008 .004 (b) Letter mode Country .102*** .078*** .113*** .130*** .113*** Grade .061*** .122*** .114*** .088*** .125*** Sex .014** .010** .002 .000 .001 C  G .030*** .035*** .051*** .028*** .051*** C  S .014*** .014** .002 .013** .016*** G  S .005 .007 .000 .003 .001 C  G  S .014*** .017** .012 .018** .009

Note. Significance levels of the effect sizes refer to the probability level of the corresponding F ratio of the independent variable(s).

(49)

Table 4

Overview of the Hypothesis Tests and the Statistical Models Used

Statistical aspects

Conditions for equivalence Procedure to

establish

equivalence Question examined

Statistical model used Structural equivalence Measurement unit equivalence Full score equivalence Internal Focus on tests of inductive reasoning

Are facet level difficulties and item difficulties related?

linear logistic

model correlations significant in each country (hypothesis 1a) correlations significant and identical across countries (hypothesis 1b) Focus on tests of inductive reasoning

Is there item bias? Logistic

regression Absence of item bias (hypothesis

3) External Focus on relationship of skill components and inductive reasoning

(50)

Table 5

Accuracy of the Design Matrices per Task and per Country: Means (and Standard Deviations) of Correlation

Stimulus mode

Figures Letters

Skill Zam Tur Net Zam Tur Net

(51)

Table 6

Analysis of Variance of Correlations with Country, Stimulus Mode, and Skill as Independent Variables

Source df F Variance explained

(52)

Table 7

Fit Indices for Nested Multiple Indicators Multiple Causes Models of Figure and Letter Tasks Contribution to  per

country(percentage)

Invariant matrices (df) Zam Tur Net NNFI GFI RMSEA 2 (df)

(a) Figure mode

y 533.97*** (167) 21 32 47 .88 .96 .045 y 437.28*** (145) 19 35 47 .89 .96 .043 96.69*** (22) y 180.52*** (79) 20 27 53 .93 .98 .034 256.76*** (66) y 134.01*** (46) 20 19 61 .91 .98 .042 46.51 (33) y 98.72*** (35) 20 18 62 .90 .99 .041 35.29*** (11) (b) Letter mode y 473.33*** (167) 33 33 34 .89 .82 .041 y 364.63*** (145) 28 30 41 .91 .87 .037 108.70*** (22) y 180.60*** (79) 35 26 39 .92 .90 .034 184.03*** (66) y 82.07** (46) 56 25 19 .95 .94 .027 98.53*** (33) y 61.61** (35) 54 24 23 .96 .94 .027 20.46* (11)

(53)

131.54, p < .001; GFI = .90, NNFI = .95, RMSEA = .029). The two analyses confirmed that a choice of a model of equal regression coefficients of the letter mode across countries does not lead to the elimination of relevant country differences. Net = the Netherlands; Tur = Turkey; Zam = Zambia; NNFI = Nonnormed Fit Index; GFI = Goodness of Fit Index; RMSEA = Root Mean Square Error of Approximation; 2 = decrease of 2 value.

(54)

Figure Captions

Figure 1. Estimated facet level difficulties per test and country of the figure mode Note. The first level of each facet (see Appendix A), arbitrarily set to zero is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2 ; R3: Item rule: 3; P3: Number of figures per period: 3; P4: Number of figures per period: 4; D2: Number of different elements of subsequent figures: 2 ; D3: Number of different elements of subsequent figures: 3; DV: Number of different elements of subsequent figures: variable; V: Variation across periods: variable; C: Periodicity cues: absent; PR: Periods repeat each other: no; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.

Figure 2. Estimated facet level difficulties per test and country of the letter mode Note. The first level of each facet (see Appendix A), arbitrarily set to zero, is not presented in the figure. Net = Netherlands. Tur = Turkey. Zam = Zambia. R2: Item rule: 2; R3: Item rule: 3; R4: Item rule: 4; R5: Item rule: 5; L2: Number of letters: 2; L3: Number of letters: 3; L4: Number of letters: 4; L5: Number of letters: 5; L6: Number of letters: 6; LV: Number of letters: variable; D2: Difference in positions in alphabet: 2; D3: Difference in positions in alphabet: 3; D4: Difference in positions in alphabet: 4; V2: Number of valid triplets: 2; V3: Number of valid triplets: 3; V4: Number of valid triplets: 4; V5: Number of valid triplets: 5; F: One of the alternatives follows the rule: no; NR: Rows repeat each other: no.

(55)

(a) Inductive Reasoning Figures -1 -0.5 0 0.5 1 1.5 2 R2 R3 P3 P4 D2 D3 V PR Facet level D if fi cu lt y

(b) Rule Classification Figures

-1 -0.5 0 0.5 1 1.5 2 2.5 R2 R3 P3 P4 D2 D3 DV V C PR F Facet level D if fi c u lt y

(c) Rule Generating Figures

-0.5 0 0.5 1 1.5 2 2.5 3 3.5 R2 R3 D2 D3 V2 V3 Facet level D if fi cu lt y

(d) Rule Testing Figures

-1 -0.5 0 0.5 1 1.5 2 R2 R3 P3 P4 D2 D3 DV V C F NR Facet level D if fi cu lt y

(56)

(a) Inductive Reasoning Letters -3.5 -2.5 -1.5 -0.5 0.5 1.5 R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4 Facet level D if fi cu lt y

(b) Rule Classification Letters

-2.5 -1.5 -0.5 0.5 1.5 2.5 R2 R3 R4 R5 L2 L3 L4 L5 L6 D2 D3 D4 F Facet level D if fi c u lt y

(c) Rule Generating Letters

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 R2 R3 R4 R5 L2 L3 L4 D2 D3 D4 V2 V3 V4 V5 Facet level D if fi cu lt y

(d) Rule Testing Letters

-2 -1 0 1 2 3 R2 R3 R4 R5 L3 L4 L5 L6 LV D2 D3 D4 F NR Facet level D if fi cu lt y

(57)
(58)

0.4

0.5

0.6

0.7

0.8

0.9

1

Low

Medium

High

(59)

Appendix A: Test Facets

The following table provides a description of the facets of the examples of the figure tests:

Test

Faceta Level IRF RCF RGFb RTF

Item rule 1 * *

2 *

3 *

Number of figures per

period 2 -3 * -4 * - * Number of different elements of subsequent figures 1 * * * 2 3 * Variable -

-Variation across periods constant * * - *

Variable

-Periodicity cues Present -

-Absent - *

-Periods repeat each other

Yes * - *

No *

-Number of valid triplets 1 - -

-Number of valid triplets 2 - - *

-3 - -

-One of the alternatives follows the rule

Yes - * - *

No -

-Rows repeat each other Yes - - - *

No - -

-Note. An asterisk indicates that the corresponding facet level applies to the item; a dash indicates that the facet is not present in the test. aSee the text for an

Referenties

GERELATEERDE DOCUMENTEN

Churchill continued to seek ways of asking Stalin to allow British and American aircraft, flying from Britain, to drop supplies on Warsaw and then fly on to Soviet air bases to

At the end of the Section 4 we exploit such an exponential stability in order to control the scale of the desired shape by only controlling the distance between the first and the

To assess the extent to which item parameters are estimated correctly and the extent to which using the mixture model improves the accuracy of person estimates compared to using

Accordingly, we test first for the factorial validity of the measuring instrument and for the multigroup equivalence of this factorial structure (i.e., the configural SEM model)

For example, in the arithmetic exam- ple, some items may also require general knowledge about stores and the products sold there (e.g., when calculating the amount of money returned

For both item subsets and for both the TIR-I and TIR-II data, the authors concluded on the basis of the scalability results and the monotonicity results that the IRFs of the items

Moreover, Hemker, Sijtsma, Molenaar, &amp; Junker (1997) showed that for all graded response and partial-credit IRT models for polytomous items, the item step response functions (

In the case of cross-cultural studies with measurement unit equivalence, no direct score comparisons can be made across cultural groups unless the size of the offset is known