Creating a pretest for an intelligence test

(1)

Creating a Pretest for an Intelligence Test Jochem Bout

Master thesis 595887

University of Amsterdam Harrie C. M. Vorst, supervisor

(2)

Abstract

The goal of this research was to create a pretest for an intelligence test. Intelligence tests can be costly and time consuming. In case the test is used for selection, a pretest can be used as a tool to advice test takers on taking the intelligence test. The pretest should predict if a person will be selected or not. In this article it is assumed that the items for the pretest are picked from a large item pool. Based on the external item selection method, the focus was on finding heterogeneous items that predicted if someone will be selected or not. For the construction of a logistic regression model six cognitive tests were used. The first and second model consisted of items with the highest item-rest correlation on each test but for the second model item-item-rest correlations were corrected for the time limit of the test. From these items, predictors were selected with a backwards stepwise

regression procedure. The accuracy of both models was cross-validated and was moderate. The first model outperformed the second model but it is advised to correct for the time limit because of the generalizability of the model.

(3)

Creating a Pretest for an Intelligence Test

Intelligence tests are often used by companies or associations for selection. These

intelligence tests are costly and can be time consuming to administer. An intelligence test can give disappointing results for test takers when it is used for selection. Therefore it can be useful to create a take-home or pretest that predicts if a person will be selected when taking the intelligence test.

There are different methods for constructing tests. These methods focus on different kinds of test characteristics; like construct validity or reliability. In case of a pretest for an intelligence test one would like to construct a predictive test. A test construction method that focuses on predictive validity or criterion validity is the external method (Burisch, 1984). This method does not require any theory, and concepts are established by the criterion (Mumford & Owen, 1987). With the external method it is important to have heterogeneous items (Oosterveld, 1996). When the item pool is heterogeneous, all the different items will make some prediction on the criterion. The criterion, which there will be predicted on, is the cutoff score on the intelligence test for which participants will be selected. This cutoff score creates a selected and non-selected group. There is a disadvantage of having heterogeneous items; it causes the test to have a low reliability. But with predictiveness as the main focus this is not necessarily a problem. Often, an intelligence test is a complex

measurement tool that measures many factors. To increase the predictive power of the pretest it is important to use items that reflect these different factors. Therefore, it is best to have an item pool with different types of cognitive ability tests from which the predictors or items for the pretest are selected.

One of the requirements for a pretest is that it does not consume too much time for a test taker. In case there is a large item pool to pick items from for the pretest, it is necessary to have an item selection method. The items used for selection should discriminate between the selected group and the non-selected group. A measure that is used to indicate the discrimination of an item is the item-total correlation (Mellenbergh, 2011). This index is the product moment correlation between

(4)

the item score and the total score on a test. However, the item-total correlation has a drawback because the item score is part of the total score. This does not have a large effect on the correlation when a test has many items, but with short tests this effect cannot be ignored. An adjusted version of the item-total correlation is the item-rest correlation. This index is the product moment

correlation between the item score and the total score on the test without the item. The index can be used to select the most discriminating items on each test in an item pool.

To predict if someone will be selected based on the intelligence test a logistic regression model can be used (Field, Miles & Field ,2012). The best discriminating items are not necessarily the best predicting items. To find the best predicting items out of the most discriminating items the backwards stepwise regression procedure can be used. This procedure selects predictors or items based on the Akaike Information Criterion (AIC) of a model with or without the predictor. It is an iterative process that starts with a regression model in which all predictors are entered. After the procedure the best predicting items remain. The criticism on the stepwise regression procedure is that it can result in models that are hard to reproduce and that can be unstable (Austin & Tu, 2004). To account for this problem cross-validation is necessary (Hays, Reas & Shaw, 2002).

To cross-validate the model a data set can be divided into a training and test sample. When a non-parsimonious model is used, which is the case for the pretest, it is best to have a large training sample (Brown, 2000). For example, use 80% of the data for training the model and 20% to validate the model. When only a small group of the sample scores high enough on the intelligence test to be selected a smaller proportion should be considered.

When some tests in the item pool used for prediction are time limited, it can be useful to correct the item-rest correlation because the time limit does not apply to the pretest. This can be done by excluding items that are affected by the time limit of the test and recalculate the item-rest correlations. When the data has information on missing answers it is suggested to exclude the items after the point where most participants were not able to answer the items. Sometimes this

(5)

information is unavailable a different approach is necessary. When the items are multiple choice and arranged by difficulty, items can be excluded from the point where the correct response proportion is below chance. In case the items are not multiple choice or not arranged by difficulty things become more complicated. In the ideal scenario a drop of the correct response proportion of the items can be observed and items can be excluded after this drop.

A pretest can be useful but it also has an effect on the eventual score of the intelligence test it is created for. Hausknecht, Halpert, Paolo and Gerrard (2007) did research on practice effects of cognitive tests. Results from a meta analysis indicate that participants can score up to half a standard deviation higher due to practice effects. This should be taken into consideration when setting the cutoff score for selection. Sometimes, participants that are used to create the pretest took a lot of cognitive tests prior to taking the intelligence test. It can be necessary to decrease the cutoff score for selection when training the regression model.

The focus lies on predictive validity when using the external method to create a test. But it can be insightful to look at the intercorrelations of the items on the pretest as well. These

correlations can be used to investigate if the items are truly measuring different constructs. Construct validity is indicated when the intercorrelations between items that originate from the same test are higher than those that originate from different tests. To calculate the intercorrelations tetrachoric correlations can be used (Muthén & Hofacker, 1988). It is expected that overall these correlations will be weak because good predictors should not be highly correlated. The correlations can be visualized with a network that can be inspected (Epskamp, Cramer, Waldorp, Schmittmann & Borsboom, 2012). In the network the items on the pretest will be represented by nodes and the intercorrelations will be represented by edges.

The method described for creating a pretest for an intelligence test was tested. The item pool from which the predictors were selected consisted of six cognitive tests from which a maximum of 30 items were selected. Three of these tests were affected by a time limit. Therefore, two regression

(6)

models were created. For the first model all items in the item pool were used and for the second model the items influenced by the time limit were excluded. Afterwards these models were compared. It was expected that the first model would outperform the second model on accuracy.

Method Participants

In this research the data of three psychology test weeks (2004, 2006 and 2007) from the University of Amsterdam was used. The participants were 1553 first year psychology students for whom the test weeks were a mandatory part of the curriculum. Their age ranged from 16 to 52 years (Mage = 21.01, SDage = 4.64, 42 unknown). Of the participants, 444 (28.59%) were male and 1069 (68.83%) female, 40 (2.58%) unknown.

Materials

To predict if someone was intelligent enough to score above the predetermined cutoff score on the intelligence structure test (intelligentie structuur test, IST; Vorst, van Osch and Muradin, 2010) six Dutch cognitive tests were used (Elshout, 1977). It concerns the following tests: 'conclusions' (conclusies), 'series' (reeksen), 'computations' (rekensnelheid), 'verbal analogies' (verbale analogiën), 'hidden figures' (verborgen figuren) and 'vocabulary' (vocabulair).

'Conclusions' measured the deducting capacity of a participant. It consisted of 40 multiple choice items that were ordered by increasing item difficulty. Participants had eight minutes to complete the test. They had to choose what kind of relationship two variables had based on the relationship the variables had with another variable. An example of a question is: 'what is the relation between A and C? A = B > C'. The possible answers were '< (smaller than), > (bigger than), ≤ (smaller than or equal to), ≥ (bigger than or equal to) or ∆ (no relationship can be deducted from the current information)'. The total score was the sum of all items correctly answered and ranged from 0 to 40.

(7)

choice items that a participant had to complete within 12 minutes. The items were ordered by increasing difficulty. A participant had to find the correct number that followed in a number series. An example of an item is: 'What is the number that follows in the series? 1, 3, 5, 7, 9'. The possible answers were '10, 11, 12, 13 or 14'. The sum of all correctly answered items was computed to get a total score that ranged from 0 to 30.

The computational speed of a participant was measured with 'computations'. This test consisted of 90 math problems. The following math problems were questioned: addition, subtracting, multiplication and division. The participants were instructed to answer the items in order. They had to answer as much items as possible in four minutes. An example of an item is: '133 - 6 ='. The total score was the sum of the correctly answered items and ranged from 0 to 90.

To measure verbal abstract reasoning the test 'verbal analogies' was used. This test consisted of two parts with 20 items each. The participants had five minutes to finish the multiple choice items of each part. They had to compare a relationship between two words to the relation between two other words. An example of a translated question is: 'body : food = machine : ?'. The question had the following response options: 'wheels, fuel, movement and fire'. The total score was

computed by summing all correct answers; it ranged from 0 to 40.

'Hidden figures' measured figural abstract reasoning. This test consisted of two parts that contained 16 items each. Participants had ten minutes to finish each part. Each question consisted of a figure in which one of five figures was hidden. For each item the answer possibilities were the same. The total score was the sum score of the items correctly answered and ranged from 0 to 32.

'Vocabulary' was measured with 40 multiple choice items. Participants had ten minutes to finish the test. They had to give the correct synonym for a word. An example of a translated word with possible answers is: 'apocryphal =' 'sectarian, vague, unrecognized or disguised'. The score ranged from 0 to 40 and was computed by summing all correct answers.

(8)

(Liepmann, Beauducel, Brocke & Amthauer, 2000 as cited in Vorst, van Osch and Muradin, 2010). The design of the intelligence test was based on theories of Thurstone, Vernon and Cattell and is rather complex. It used a 3x3 facet design to describe intelligence. Based on a combination of three problem areas and three cognitive processes, nine facets of intelligence were measured. The

problem areas are: verbal, numeric and figural; the cognitive processes are: general reasoning, remembering and common knowledge. This leads to the following facets: verbal reasoning,

numerical reasoning, figural reasoning, verbal memory, numerical memory, figural memory, verbal knowledge, numerical knowledge and figural knowledge. The test consisted of 297 items. The total score was the sum of all correct items and ranged from 0 to 297. The cutoff score for the selection condition was 208 and above. This score is equal to an IQ score of 120 and higher based on the norm group in the manual (Vorst, van Osch and Muradin, 2010).

Procedure

The administered tests were part of a series of tests given during five sessions of the

psychology test weeks. These sessions took place in the computer area of a library of the University of Amsterdam. Every other week participants had to attend a two-hour session. These sessions were held in silence and were monitored by experimenters.

Analyses

For the analyses, data of the six cognitive tests and IST had to be combined. There was some missing data in the data set. The data of the IST had eight items missing on the numerical reasoning facet for testweek year 2006. The items were missing for all 456 participants of that year. It is likely that the data is missing at random (MAR). Data is MAR when the likelihood of data missing for a variable is related to the value of other variables in the analytic model (Cox, McIntosh, Reason & Terenzini, 2014). Because the data was MAR imputation was used to estimate the missing values. The missing items were imputated using logistic regression. The model was based on the other items on the facet and the data of the years 2004 and 2007. Some participants were excluded

(9)

because they did not take the IST or one of the six cognitive tests. From the 1553 participants in the original data set 73 participants were excluded from the data set because they did not take the IST and 101 participants were excluded because they did not take one or more of the six cognitive tests. To train the logistic regression model and cross-validate it the participants were randomly assigned to a training sample and a test sample.

Results

For the analyses 689 participants were assigned to a training sample. Participants had a mean age of 21.21 years (SD = 4.88, range = 16-52). In Table 1 the distribution of sex over the selection conditions is shown. Males and` females were not equally distributed over the conditions,

X2_{(13) = 33.95, p < .001. There were relatively more males in the selected condition than females.} Therefore the performance statistics of the logistic regression model will be reported for both males and females. The test sample contained 690 participants with a mean age of 20.89 years (SD = 4.30, range = 17-51). The distribution of males and females was unequally distributed over the

conditions, X2_{(13) = 33.24, p < .001. There were relatively more males in the selected condition}

than females.

Table 1: Frequencies and Percentages of Participants Distributed over Sex and Selection Conditions for Training and Test Sample.

Training sample Test sample

Sex Sex

Condition Male (%) Female (%) Total (%) Male (%) Female (%) Total (%) Non-selected (%) 158 (22.93) 479 (69.52) 637 (92.45) 161 (23.33) 467 (67.68) 628 (91.01) Selected (%) 33 (4.79) 19 (2.75) 52 (7.54) 38 (5.51) 24 (3.48) 62 (8.99) Total (%) 191 (27.72) 498 (72.28) 689 (100) 199 (28.84) 491 (71.16) 690 (100) Model 1: Item-Rest Correlation

For the first model, the items used in the stepwise regression procedure were the ten items with the highest item-rest correlation on each cognitive test. In Table 2 the reliability for each test is shown. All tests had a high reliability except for 'vocabulary'.

(10)

Table 2: Number of Items and Cronbach's α with 95% CI for the Six Cognitive Tests.

Test Items 95% CI for Cronbach's α LL Cronbach's α UL Conclusions 40 .88 .89 .91 Series 30 .68 .71 .75 Computations 90 .89 .90 .91 Verbal analogies 40 .74 .77 .80 Hidden figures 32 .89 .90 .92 Vocabulary 40 .51 .56 .61 n = 689

The 60 items with the highest item-rest correlations, ten items of each test, were entered into a logistic regression model that predicted if participants were in the selected or non-selected

condition. After entering the 60 items into the model some items had to be replaced because of high standard errors of the beta coefficient. These high standard errors were caused by a lack of variance of the item on the conditions or due to multicollinearity. Table 3 shows which items were excluded from the model and which items were entered. This model had a better fit than a model without predictors, X2_{(60) = 207.83, p < .001, Nagelkerke's R}2_{= .63, AIC = 282.88.}

(11)

Table 3: Item Rest Correlation (rir), Mean (M) Correct and Standard Deviations (SD) of Items (i)

Used for Model 1.

Conclusions Series Computations Verbal analogies Hidden figures Vocabulary

i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir

24 .40 (.49) .66 18 .30 (.46) .46 *20 .44 (.50) .7 *II6 .84 (.37) .43 II11 .45 (.50) .61 17 .54 (.50) .32 25 .41 (.49) .63 *10 .74 (.44) .39 21 .37 (.48) .67 *II2 .84 (.37) .42 II9 .56 (.50) .61 12 .29 (.45) .31 26 .33 (.47) .61 19 .13 (.34) .36 24 .24 (.43) .65 II16 .78 (.41) .38 I8 .55 (.50) .56 25 .49 (.50) .31 23 .48 (.50) .61 *9 .77 (.42) .35 26 .18 (.38) .64 II12 .81 (.40) .34 I9 .43 (.50) .55 10 .16 (.36) .31 22 .44 (.50) .60 14 .25 (.43) .35 19 .36 (.48) .64 I16 .45 (.50) .34 II5 .62 (.49) .55 39 .73 (.44) .31 19 .52 (.50) .57 16 .27 (.44) .34 *17 .60 (.49) .63 I3 .79 (.41) .34 I14 .3 (.46) .55 29 .39 (.49) .28 29 .21 (.41) .55 8 .65 (.48) .34 25 .17 (.38) .62 I7 .74 (.44) .33 II14 .32 (.47) .53 8 .64 (.48) .27 31 .12 (.33) .54 17 .46 (.50) .34 16 .62 (.49) .61 I9 .73 (.45) .33 II13 .37 (.48) .52 22 .49 (.50) .25 30 .13 (.34) .52 12 .53 (.50) .33 28 .13 (.34) .61 I14 .65 (.48) .33 II15 .27 (.44) .52 6 .40 (.49) .25 32 .10 (.30) .51 *11 .79 (.40) .32 22 .23 (.42) .61 II8 .62 (.49) .32 II12 .36 (.48) .52 24 .43 (.50) .24 *4 .86 (.35) .30 II10 .49 (.50) .32 7 .82 (.39) .26 I6 .88 (.33) .32 20 .12 (.32) .25 *1 .89 (.31) .24 13 .35 (.48) .24

n = 689, * removed due to multicollinearity or a lack of variance of the item on the conditions. The goal was to have a maximum of 30 items on the pretest. A backwards stepwise

regression procedure was performed on the model with 60 items to find the items with the highest predictive values. Table 4 shows the coefficients and corresponding odds ratios of the predictors found with the backwards stepwise procedure. The final model had a moderate fit on the data, X2_{(16) = 181.98, p < .001, Nagelkerke's R}2_{= .56, AIC = 220.73. Compared to the model with 60}

(12)

Table 4: Logistic Regression Coefficients (β), Standard Errors (SE) and Odds Ratios of Model 1 of the Ten Items With the Highest Item Rest Correlation of Each Cognitive Test.

Item β (SE) p 95% CI for odds ratio

LL Odds ratio UL (Intercept) -11.40 (1.46) <.001*** concl19 0.98 (0.49) .044* 1.07 2.67 7.35 concl32 0.95 (0.46) .040* 1.04 2.58 6.40 serie18 1.42 (0.44) .001** 1.77 4.14 10.25 serie14 1.13 (0.43) .009** 1.34 3.10 7.41 serie16 1.31 (0.43) .002** 1.63 3.70 8.73 comp21 1.44 (0.48) .003** 1.70 4.23 11.23 comp26 -1.22 (0.66) .064 0.08 0.30 1.02 comp27 1.50 (0.66) .023* 1.27 4.49 17.28 verbanI16 1.24 (0.44) .005** 1.51 3.45 8.44 verbanI3 1.18 (0.66) .073 0.98 3.24 13.37 verbanI7 1.11 (0.65) .089 0.93 3.04 12.56 hidfigII11 1.33 (0.43) .002** 1.67 3.76 9.03 vocab39 1.02 (0.63) .104 0.88 2.78 10.90 vocab29 0.98 (0.41) .017* 1.21 2.67 6.09 vocab8 1.10 (0.53) .037* 1.12 3.01 9.12 vocab22 -0.84 (0.46) .065 0.17 0.43 1.04

n = 689, concl = conclusions, serie = series, comp = computations, verban = verbal analogies, hidfig

= hidden figures, vocab = vocabulary, * p < 0.05, ** p < 0.01, *** p < 0.001.

To inspect the performance of the final model a few statistics were calculated for the test sample, Table 5. The area under ROC curve indicates that the accuracy of the model was good. The sensitivity and specificity statistics indicate a moderate performance. The cutoff for selection was determined by maximizing the sensitivity and specificity. Based on this model 18% of the

participants in the sample were incorrectly predicted to be in the selected condition and 18% in the non-selected condition. The corresponding φ indicates how well the model predicted compared to a random prediction. The statistic in Table 5 indicates that the model predicted better than a random prediction. Because sex was unequally distributed over the selection conditions the performance statistics were calculated for males and females separately. Although the φ indicates a lower performance of the model for females compared to males, the other statistics do not show any

(13)

disturbing differences.

Table 5: Area Under ROC Curve (AUC), Compared to Random (φ), Cutoff for Selected/Non-selected, Sensitivity and Specificity Measures for Model 1 Fitted on Test Sample.

AUC φ Cutoff Sensitivity Specificity

Overall .88 .44 .051 .82 .82

Males .85 .51 .051 .84 .75

Females .89 .36 .051 .79 .84

n = 690

The focus of the model construction lay on creating a highly predictive model but to get an indication of the reliability and validity the intercorrelations of the predictors were investigated. For the reliability Cronbach's α was calculated with the training sample. The reliability of the predictors of the model is poor, Cronbach's α = .65, 95% CI [.60, .69]. This was expected because the

intercorrelations of the predictors should be low to be good predictors. To investigate the construct validity of the predictors a network was plotted of the significant (α < .05) tetrachoric correlations of the predictors. The network model in Figure 1 shows that predictors that originate from the same cognitive test have a higher intercorrelation than predictors of a different cognitive test. The tests 'computations', 'series' and 'conclusions' correlate stronger with each other than with the other three tests. But even items that originate from these tests have the strongest correlations with other items on the original test. This indicates that the predictors measure different constructs and that the predictors are heterogeneous.

(14)

Model 2: Item-Rest Correlation Corrected for Time Limit

Model 1 had a moderate fit on the data, nevertheless there is a possibility this model does not predict well when it is used on data produced with the pretest. Some of the items were

influenced by the time limit of the test they were on. The tests in question are 'conclusions', 'series' and 'computations'. Items on these tests were excluded after the proportion correct of the items dropped below chance. With 'conclusions' this was after item 27 and with 'series' after item 18. With 'computations' a different approach was necessary because the answers were not multiple choice. A drop in proportion correct could be observed after item 19. Items succeeding this item were

excluded. The recalculated reliabilities are shown in Table 6. All the tests are reliable except for 'vocabulary'.

Figure 1 Network Model of Significant Tetrachoric Correlations for Predictors on Model 1, Node Size is Two Times the Relative Odds Ratio, Venn Diagrams Are the 95% CI per Cognitive Test, n = 689.

(15)

Table 6: Number of Items and Cronbach's α with 95% CI for the Six Cognitive Tests Corrected for Time Limit.

Test Items 95% CI for Cronbach's α LL Cronbach's α UL Conclusions 27 .85 .87 .89 Series 18 .73 .76 .79 Computations 19 .79 .82 .84 Verbal analogies 40 .74 .77 .80 Hidden figures 32 .89 .90 .92 Vocabulary 40 .51 .56 .61 n = 689

Like model 1, this model was based on the ten items with the highest item-rest correlations of each test. With the recalculated item-rest correlations a different set of predictors was expected. The 60 items with the highest item-rest correlation, ten items of each test, were entered into a logistic regression model that predicted if participants were in the selected or non-selected condition. After entering the 60 items into the model some items were replaced because of high standard errors of the beta coefficients. These high standard errors were caused by a lack of variance of the item on the conditions or due to multicollinearity. Table 7 shows which items were excluded from the model and which items were entered. This model had a better fit than the model without predictors, X2_{(60) = 197.63, p < .001, Nagelkerke's R}2_{= .60, AIC = 293.07.}

(16)

Table 7: Item-Rest Correlation (rir), Mean (M) Correct and Standard Deviations (SD) of Items (i)

Used for Model 2.

Conclusions Series Computations Verbal analogies Hidden figures Vocabulary

i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir i M (SD) rir

24 .40 (.49) .67 *10 .74 (.44) .51 *17 .60 (.49) .63 *II6 .84 (.37) .43 II11 .45 (.50) .61 17 .54 (.50) .32 23 .48 (.50) .62 *11 .79 (.4) .48 16 .62 (.49) .62 *II2 .84 (.37) .42 II9 .56 (.50) .61 12 .29 (.45) .31 19 .52 (.50) .62 8 .65 (.48) .45 14 .61 (.49) .57 II16 .78 (.41) .38 I8 .55 (.50) .56 25 .49 (.50) .31 25 .41 (.49) .61 *9 .77 (.42) .45 18 .44 (.50) .57 II12 .81 (.40) .34 I9 .43 (.50) .55 10 .16 (.36) .31 22 .44 (.50) .59 *4 .86 (.35) .43 11 .79 (.41) .57 I16 .45 (.50) .34 II5 .62 (.49) .55 39 .73 (.44) .31 26 .33 (.47) .55 14 .25 (.43) .41 19 .36 (.48) .55 I3 .79 (.41) .34 I14 .3 (.46) .55 29 .39 (.49) .28 18 .64 (.48) .55 18 .3 (.46) .41 15 .33 (.47) .52 I7 .74 (.44) .33 II14 .32 (.47) .53 8 .64 (.48) .27 21 .37 (.48) .52 7 .82 (.39) .39 12 .80 (.40) .48 I9 .73 (.45) .33 II13 .37 (.48) .52 22 .49 (.50) .25 17 .67 (.47) .51 12 .53 (.5) .38 13 .66 (.47) .46 I14 .65 (.48) .33 II15 .27 (.44) .52 6 .40 (.49) .25 14 .42 (.49) .46 *5 .91 (.28) .36 9 .83 (.38) .40 II8 .62 (.49) .32 II12 .36 (.48) .52 24 .43 (.50) .24 *1 .89 (.31) .31 7 .65 (.48) .37 II10 .49 (.50) .32 16 .27 (.44) .3 I6 .88 (.33) .32 13 .35 (.48) .28 17 .46 (.5) .27 6 .89 (.31) .26 *2 .91 (.28) .25 3 .97 (.16) .24

n = 689, * removed due to multicollinearity or a lack of variance of the item on the conditions.

For model 2 the number of items also had to be reduced to a maximum of 30 items. The backwards stepwise regression procedure was used to achieve this. Table 8 shows the coefficients and corresponding odds ratios of the predictors found with the backwards stepwise procedure. The model had a moderate fit, X2_{(17) = 175.36, p < .001, Nagelkerke's R}2_{= .54, AIC = 229.35.}

(17)

Table 8: Logistic Regression Coefficients (β), Standard Errors (SE) and Odds Ratios of Model 2 of the Ten Items With the Highest Item-Rest Correlation of Each Cognitive Test.

Item β (SE) p 95% CI for odds ratio

LL Odds ratio UL (Intercept) -10.85 (1.48) <.001*** concl18 1.12 (0.57) .049* 1.07 3.05 10.03 concl21 0.79 (0.41) .051 1.01 2.21 5.03 serie14 0.84 (0.44) .054 0.99 2.32 5.55 serie18 1.12 (0.47) .018* 1.24 3.06 7.96 serie16 1.23 (0.42) .003** 1.53 3.42 7.93 serie13 0.89 (0.45) .045* 1.03 2.44 5.99 serie17 0.72 (0.50) .148 0.79 2.06 5.73 comp18 1.03 (0.44) .020* 1.21 2.80 6.88 comp12 -1.06 (0.61) .085 0.11 0.35 1.20 comp7 1.29 (0.60) .033* 1.19 3.64 13.09 verbanI16 1.35 (0.44) .002** 1.67 3.85 9.57 hidfigI8 0.76 (0.44) .082 0.93 2.14 5.23 hidfigII13 0.98 (0.40) .014* 1.23 2.66 5.94 vocab39 1.02 (0.64) .112 0.86 2.78 11.29 vocab29 1.09 (0.42) .009** 1.33 2.96 6.87 vocab8 1.16 (0.52) .026* 1.20 3.19 9.47 vocab22 -0.80 (0.45) .077 0.18 0.45 1.08

n = 689, concl = conclusions, serie = series, comp = computations, verban = verbal analogies, hidfig

= hidden figures, vocab = vocabulary, * p < 0.05, ** p < 0.01, *** p < 0.001.

A few performance statistics for model 2 were calculated based on the test sample, see Table 9. The area under ROC curve indicates that the accuracy of the model is good. The sensitivity and specificity statistics indicate a moderate performance for the overall model. The cutoff was determined by maximizing the sensitivity and specificity. The corresponding φ indicated that the model predicts better than a random prediction. Based on this model 21% of the participants in the sample were incorrectly predicted to be in the selected condition and 21% in the non-selected condition. Because sex was unequally distributed over the conditions the performance statistics were also calculated for males and females separately. There is difference in performance of the model when comparing the statistics of males and females. The comparison of φ between sex indicated that the model predicts more poorly for females. Also, 33% of the females were

(18)

incorrectly predicted to be in the non-selected condition which is much higher than for males with only 13% incorrect.

Table 9: Area Under ROC Curve (AUC), Compared to Random (φ), Cutoff for Selected/Non-selected, Sensitivity and Specificity Measures for Model 2 Fitted on Test Sample.

AUC φ Cutoff Sensitivity Specificity

Overall .87 .37 .042 .79 .79

Males .82 .43 .042 .87 .67

Females .87 .27 .042 .67 .83

n = 690

The reliability and validity of the predictors in model 2 were investigated. The reliability was calculated with the training sample. The reliability of the predictors in the model was poor, Cronbach's α = .64, 95% CI [.60, .68]. This was expected because the intercorrelations of the predictors should be low to be good predictors. To investigate the construct validity of the predictors a network was plotted of the significant (α < .05) tetrachoric correlations of the predictors. The network model in Figure 2 shows that predictors that originate from the same cognitive test have higher intercorrelations than predictors from a different cognitive test. The tests 'computations', 'series' and 'conclusions' correlate stronger with each other than with the other three tests. These correlations seem to be weaker than with model 1. This is probably due to the

correction for time limit. But even items that originate from 'computations', 'series' and 'conclusions' have the strongest correlations with other items on the original test. This indicates that the

(19)

Model Comparison

In order to compare the two models performance statistics from Table 5 and 9 were investigated. This comparison shows that the AUC did not indicate much difference in accuracy between the models. However, the φ index of the model indicated model 1 outperformed model 2. Furthermore, the accuracy of model 2 seems to differ greatly between males and females. It was expected that model 1 would outperform model 2 because it had items included which were affected by the time limit of the test. Such as item 32 of ‘conclusions’ and all the items of

‘computations’. These items can be good predictors because the speed by which a test is answered is a good indicator of how well a participant will perform.

The predictors of the models were compared as well. For ‘conclusions’, item 18 and 21 were predictors on model 2 but not in model 1. Both these items were not in the ten items with the

Figure 2: Network Model of Significant Tetrachoric Correlations for Predictors on Model 2, Node Size is Two Times the Relative Odds Ratio, Venn Diagrams Are the 95% CI per Cognitive Test, n = 689.

(20)

highest item-rest correlations entered in model 1. Items 13, 14 and 17 of ‘series’ were extra predictors in model 2. The predictors selected of ‘computations’ were totally different between model 1 and 2. Model 1 had items 21, 26 and 27 as predictors and model 2 had items 7, 12 and 18 as predictors. It is interesting to see that only item I16 of ‘verbal analogies’ was selected as a predictor in model 2, but in model 1 item I3 and I7 were selected as well. Since the item-rest correlations were the same as in model 1 the same items were entered of this test in model 2. The difference in the selected predictors between the models is probably due to the influence of the new items of other tests. This situation occurred for 'hidden figures' as well. Two different items were selected as predictors in model 2. The same items were selected of 'vocabulary' for both models. For this data set model 1 is the preferred model. But speed will not be taken into account on the pretest. Therefore model 2 is the preferred model to create the pretest with.

Discussion

The method used for creating a pretest for an intelligence test had a moderate performance. It was possible to create a predicting pretest based on item's item-rest correlation and selecting items with backwards stepwise regression. A moderate predicting pretest can be made even when correcting for time limits on the cognitive tests. When correcting for time limits the pretest is less accurate but it will produce a more generalizable test. Upon further analysis of the intercorrelations it appeared that the items that originate from the same cognitive test have the highest correlations. This indicates construct validity. During the use of the proposed method a few problems were encountered. These problems and their implications will be discussed based on the method step in which they occurred.

The intelligence test used had a complicated measurement model. The IST measured nine different facets from which only four were represented by the cognitive tests. The facets of the problem area memory were not represented and also common knowledge was underrepresented. Because the cognitive tests used to create the item pool did not represent the used intelligence test,

(21)

the model is less accurate in predicting the criterion; the IST cutoff score.

One of the problems in this study was that the sample used differed from the population that the pretest is designed for. The proportion of participants scoring above the cutoff score was small. This resulted in a larger influence of some participants on the regression model parameters. Also, some items had to be excluded due to a lack of variance. Because of the small proportion of participants selected the distribution of males and females was not equal over the conditions. This resulted in a difference in accuracy between sex. Especially for the method that corrected for the time limit. This problem can be overcome by using different cutoff scores for males and females on the regression predictions.

Selection of items by their item-rest correlation worked well. However, it is debatable if the top ten highest item-rest correlations is the right amount of items per test. Maybe ten items is too much and is the reason multicollinearity occurred. It is suggested to do more research on the right amount of items.

The reliability of the pretest is low because the stepwise regression procedure causes the intercorrelations of the predictors to be small. Cronbach's α is based on these intercorrelations and therefore it indicates low reliability. This is a trade-off that had to be made with the used method. It` was taken for granted because the most important purpose of the pretest is to accurately predict. Another disadvantage of the stepwise procedure is that it produces a pretest with unequal items per measured construct, this can give test takers the feeling that the test is unfinished. To give the pretest a professional feel it is suggested to add easy items of tests from which less than the

preferred amount of predictors were used in the model. The advantage of adding these items is that they can be used as buffer items to replace items in the proposed model.

The method described for creating a pretest for an intelligence test worked rather well. But it is important to test the pretest on the population it is intended for and recalibrate the model as necessary. When selecting people based on their intelligence the use of the pretest can reduce the

(22)

(23)

Literature

Austin, P., & Tu, J. (2004). Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. Journal of Clinical

Epidemiology, 57, 1138-1146.

Browne, M. (2000). Cross-validation methods. Journal of Mathematical Psychology, 44, 108-132. Burisch, M. (1984). Approaches to personality inventory construction: A comparison of merits.

American Psychologist, 39(3), 214-227.

Cox, B., Mcintosh, K., Reason, R., & Terenzini, P. (2014). Working with Missing Data in Higher Research: A Primer and Real-World Example. The Review of Higher Education, 37(3), 377-402.

Elshout, J.J., (1976). “Karakteristieke moeilijkheden in het denken”. Academisch proefschrift. Amsterdam: UvA.

Epskamp, S., Cramer, A., Waldorp, L., Schmittmann, V., & Borsboom, D. (2012). Qgraph : Network Visualizations of Relationships in Psychometric Data. Journal of Statistical

Software J. Stat. Soft.

Field, A. P., Miles, J., & Field, Z. (2012). Discovering statistics using R. London: Sage.

Hausknecht, J., Halpert, J., Paolo, N., & Gerrard, M. (2007). Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology,

92(2), 373-385.

Hays, J., Reas, D., & Shaw, J. (2002). Concurrent Validity Of The Wechsler Abbreviated Scale Of Intelligence And The Kaufman Brief Intelligence Test Among Psychiatric Inpatients.

Psychological Reports, 90, 355-359.

Mellenbergh, G. (2011). A conceptual introduction to psychometrics: Development, analysis and

application of psychological and educational tests. The Hague: Eleven International

(24)

Mumford, M., & Owens, W. (1987). Methodology Review: Principles, Procedures, and Findings in the Application of Background Data Measures. Applied Psychological Measurement, 11(1), 1-31.

Muthén, B., & Hofacker, C. (1988). Testing the assumptions underlying tetrachoric correlations.

Psychometrika, 53, 563-577.

Oosterveld, P. (1996). Questionnaire design methods. Nijmegen: Berkhout.

Vorst, H. C. M., van Osch, I., & Muradin, R. (2010). Intelligentie structuur test: handleiding met