Prediction with test battery data using Exploratory Factor Analysis

(1)

4

Master’s Thesis Psychology,

Methodology and Statistics Unit, Institute of Psychology Faculty of Social and Behavioral Sciences, Leiden University Date: 1-8-2019

Student number: s1485997 Supervisor: Dr. Henk Kelderman

Prediction with test battery data using

Exploratory Factor Analysis

(2)

Jona Vilders

Leiden University

Abstract

How to best predict an outcome using test battery data remains an open question. The standard procedure is to create sum-scores for each test in the test battery and fit a linear model using these, but multiple objections can be raised against this approach. Prior work has found that a CFA-based prediction method can offer superior predictive performance, but this prior work was only within the context of one- and two-factor models. I generalize the CFA approach from prior work to a multi-factorial context using EFA instead. This EFA-based method is compared with sum-score prediction, ridge regression and partial least squares regression. These four methods are used in a large simulation study, wherein different kinds of multi-factorial data are simulated. Each of the four prediction techniques is then trained on a subset of the data, called the training-set. Using these four models, predictions are made for another subset of the data, called the validation-set or test-set. By doing this, an estimate of the accuracy of the predictions is obtained on new data that was not used in fitting of the models. Results show the proposed EFA-based technique offering better predictive performance than sum-score regression and ridge regression, but doing worse than partial least squares regression. Future research directions based on supervised latent factor models are suggested.

Keywords: exploratory factor analysis, test battery, simulation study, pre-dictive modelling, regularization, shrinkage.

(3)

Contents

Introduction 3

Method 5

SEM . . . 5

Adaptation of prior work . . . 7

Simulation design . . . 8

True model input parameters . . . 9

Fixed input parameters . . . 9

Varied input parameters . . . 10

Calculation of loadings in presence of cross-loadings . . . 13

Prediction techniques . . . 16

Sum-score linear model . . . 16

Ridge regression . . . 17

Partial Least Squares . . . 18

EFA-based prediction rule . . . 19

Software and packages used . . . 20

Results strategy . . . 20 Analysis . . . 20 Visualization . . . 22 Expectations . . . 22 Results 23 Main simulation . . . 23 Discussion 31 Expectations and results of main experiment . . . 31

Answering the research question . . . 32

Limitations . . . 33 Conclusion 34 References 35 Appendix A 37 Appendix B 39 Appendix C 41 Appendix D 44

(4)

Introduction

Psychological assessment is often performed using a set of psychological tests, dubbed a test battery. These test batteries can be used to inform about a general domain, such as a person’s social competency, but are also used to predict variables of interest, such as a potential employee’s level of fit within a future working environment. In the case of prediction what is desired is a high level of predictive validity: Given that decisions of varying importance are made based on the model, the predictions of that model should be as accurate as possible. To mimic that real-life situation, having to make predictions for a dependent variable given new data, the goal is not just to most closely fit a model to the currently available data, but instead to find a model that generalizes well to new data. This is commonly operationalized by striving to minimize the Mean Squared Error (MSE) in a different sample than the one the prediction model was trained on: For n being the sample size in this different sample, ˆyi being the predicted outcome value for individual i and yi

being the true outcome value for individual i, this MSE is given by

M SE = Pn

i=1(ˆyi − yi)2

n . (1)

By training the model on one sample and making predictions for a different sample, a situation is avoided where a model fits too closely to the characteristics of the specific sample used to train it and as a result appears to give very good predictions, but generalizes very poorly to other samples (called overfitting). By using the MSE from a new sample instead, favored models are those that generalize well to new data, which more closely mirrors real life application of predictive models. This type of MSE is referred to as out-of-sample MSE. From literature on statistical prediction, it is known that models that aim to optimize this out-of-sample MSE are rarely equivalent to models that seek to purely explain re-lationships in the sample (Shmueli, 2010), called explanatory models; while explanatory modelling stresses unbiased parameter estimates so as to enable testing of precise hypothe-ses, predictive modelling attempts only to optimize predictions and parameter estimates are merely a tool to achieve that. Explanatory models seek to minimize bias, while predictive models minimize the sum of bias and estimation variance (Hastie, Tibshirani, & Friedman, 2008, p. 223); when it is beneficial to the accuracy of predictions, some amount of bias is accommodated in predictive models, while explanatory models do not allow this. The reason this is mentioned is because the most common way that test battery data is handled is still through explanatory modelling, even when the goal is actually prediction of some outcome variable.

A common prediction technique for this type of data is to use sum-scores from each test within the test battery and to then fit a linear model with these sum-scores as predictors of the outcome variable. A drawback of this approach is that taking the simple sum of item scores within a test assumes that all items contribute equally to the construct that the specific psychological test is intended to measure, though this is often not very detrimental to predictive performance (Einhorn & Hogarth, 1975). An assumption of using a linear regression approach is also that it assumes that the observed scores on the items are all measured without error, which can bias estimates (Stefanski, 1985). This assumption is un-likely within psychological testing as it often concerns measuring indicators of unobservable mental states, which is inherently an indirect form of measurement. Regardless, the linear

(5)

model of sum-scores remains the standard way of handling test battery data for prediction, partly due to the simplicity of implementing this approach.

In those cases when the model’s purpose is to aid in predicting an outcome, I believe a predictive modelling approach would work better instead. That idea itself is not unique: Techniques such as ridge and LASSO regression have long been shown to offer generally superior out-of-sample predictions compared to purely explanatory models (Hoerl & Ken-nard, 1970; Tibshirani, 1996). What this thesis focuses on instead is a proposed technique that is based on exploratory factor analysis (EFA).

This proposed EFA-based technique is an evolution of prior work examining predictive validity in one- and two-factor models (de Rooij et al., 2018; de Bruin, 2018). In both of those experiments a CFA-based prediction technique was used, which was generalized to an EFA-based prediction technique for the current experiment, discussed further in the method section. Both prior experiments found promising situations in which the CFA-based technique had generally superior predictive performance compared to sum-score prediction. I am hoping to replicate those findings in the case of multi-factorial data through a simula-tion study. Besides the EFA-based technique and sum-score predicsimula-tion, two more predicsimula-tion techniques are used in the simulation: Ridge regression and Partial Least Squares (PLS).

Ridge regression was included as it is a more sophisticated method than sum-score prediction and it is often used. Unlike the sum-score linear model it does not lump variables together within each test and it includes shrinkage through the `₂ norm (Hoerl & Kennard, 1970). In the context of the experiment, it is meant as an "upgrade" to sum-score regression, with less of its drawbacks.

PLS can be seen as an evolution of Principal Component Regression (PCR): While PCR finds components that maximize the variance explained within a matrix of variables X, with no explicit regard for how well those components then explain the matrix of dependent variables Y (or vector y, if there is only one dependent variable), PLS directly incorporates the dependent variable(s) in creating the components, so that predictions of the dependent variable(s) are more accurate (Abdi, 2010). PLS was selected as a technique that usually performs very similarly to ridge regression (Frank & Friedman, 1993), but one that generally performs better as the number of variables grows (Firinguetti, Kibria, & Araya, 2017). As this was something that would be varied in the experiment, I wanted to include a technique that would benefit from high number of variables to see how the EFA-based method compared under circumstances favourable to PLS (though situations where the number of variables p is greater than the number of observations n will not appear in the experiment).

The rationale behind including more prediction techniques was as follows: If the argu-ment made earlier is that sum-score regression is not a good fit for this type of data, then out-performing only the (assumed) not-so-good-fitting technique is not much of an achieve-ment and would by itself not make a strong case that the proposed EFA-based technique should be used. Instead, if it can be shown that the proposed EFA-based technique also matches or outperforms modern concurrent techniques for this type of data, then there is a stronger case for switching to the EFA-based technique. Sum-score regression is included as it is the legacy procedure, but ridge regression and PLS are meant as true competitors for the EFA method.

(6)

They do not require the researcher to specify a structure for the data a priori, unlike sum-score regression. This is important because sum-sum-score regression would almost certainly do worse if this structure was misspecified, a risk that the other three techniques do not have. With this in mind, if any of these three techniques manage only to achieve a tie with the sum-score regression model, it should still be considered a win for that technique. All four prediction techniques mentioned will be described in more detail in the method section.

Formally the research question is: “Can the EFA-based technique offer meaningful im-provement in the prediction of multi-factorial data compared to PLS, ridge regression and sum-score regression, and if so, under what circumstances?”

The method section will start with a brief primer on Structural Equation Modelling (SEM). It is useful to see the simulation design from this perspective as it makes the de-sign factors and how they were implemented more intuitive to understand. After that, the specifics of the simulation design are discussed as well as the constraints used. These constraints allow for simpler unique solutions to the calculation of factor loadings, which is discussed in its own sub-section. The prediction techniques are presented in more detail as well, and finally, the results strategy is laid out. Following that are the results from the sim-ulation as well as a visualization of those results. The research question is answered through this visualization. Finally, the discussion summarizes the findings from the experiment and presents possible avenues for future research.

A full list of all symbols used in this thesis is provided in Appendix A. Method

SEM

SEM is a type of causal modelling that involves fitting a model with a specified structure of observed or observed and unobserved variables to explain the relationships present in the data. A SEM model attempts to minimize the difference between the observed covariance matrix S in the sample and the covariance matrix implied by the specified model ˆΣ. It should be noted that SEM is not a single technique; for example, directed relationships between variables through linear regression are a part of SEM models and therefore gen-eralized linear models such as the t-test, logistic regression and ANOVA are possible as a SEM model too. Parameter estimation within SEM is conventionally done using maximum likelihood (ML) estimation. When the assumptions of the linear model are met the resulting parameter estimates are identical to those found via least-squares estimation in a standard regression model (Fox, 2008).

As an example, a two-factor SEM model with standardized latent variables is shown in Figure 1. In this figure, observed variables are within squares, latent variables within ovals, and latent error terms are in circles. Single-headed arrows are causal, double-headed arrows on the same node represent variance, and double-headed arrows between nodes represent covariance.

(7)

1 2 3 4 5 6 1 2 16 15 14 11 13 12 12 1 2 3 4 5 6 2 2 23 24 25 26 2 1 1 1 1 1 1 1 1 1 1 2 1 2

Figure 1 . Two-factor SEM model.

The model in Figure 1 has seven items referred to as indicators from hereon to follow standard terminology more closely. One of these indicators is the outcome variable Y . This model contains two latent variables, θ₁ and θ₂, which are normally distributed with mean 0 and variance 1. Each indicator in the model loads on one latent variable and Y on two, represented by the arrows from the latent variables to each indicator. The coefficient of this relationship is the loading λij for indicator j on factor i and the loading βi for loadings of

Y on factor i. The latent variables in this model have a correlation ρ12.

The total variance of any indicator is the sum of that indicator’s communality and uniqueness; the communality of an observed variable is the proportion of that variable’s variance that is explained by the model, while the uniqueness is the proportion of that variable’s variance that is unique to itself and thus unexplained. Following the tracing rules originally by Sewall Wright (Li, 1991; Wolfle, 1999), the communality for an indicator in this specific model can be found by enumerating all paths starting from an indicator back to itself but following a set of rules:

(8)

• Going up single-headed arrows is allowed, but once you go down one, you cannot go up single-headed arrows anymore.

• The same variable cannot be passed more than once per path. • Only one double-headed arrow may be used in the path.

Multiplying all coefficients in the path and summing the paths will give the communality. For example, the communality of indicator X₁ in Figure 1 is λ2

11; the only path following

the rules laid out above is from X1 to θ1 and back and because the latent variables are

standardized it is λ₁₁∗ 1 ∗ λ₁₁ = λ2₁₁. As a result the value of λ₁₁ is simply the square root of the communality. For Y it is more complex because more paths are allowed: The communality of Y is equal to β₁2+ β₂2+ 2β1ρ12β2, which is the sum of four paths: to θ1 and

back, to θ₂ and back, to θ₁ through θ₂ and back, and to θ₂ through θ₁ and back. These last two paths are equivalent due to the commutative property. The uniqueness is equal to the variance of the latent error; in the case of indicator X1 in Figure 1 it is σ2ε1.

A SEM model is described by four model matrices: for p being the number of indicators and q being the number of latent variables, Λ (Lambda) is the p × q matrix of factor loadings that define the relations between observed variables and latent factors, B (Beta) is the q × q matrix of regression coefficients between latent factors, Φ (Phi) is the q × q variance-covariance matrix of latent factors, and Ψ (Psi) is the p × p variance-covariance matrix of errors of the observed variables. To illustrate, these four matrices for the SEM model in Figure 1 are shown in Figure 2 below.

Λ =                θ1 θ2 Y β1 β2 X1 λ11 0 X2 λ12 0 X3 λ13 0 X4 0 λ24 X5 0 λ25 X6 0 λ26                , B =   θ1 θ2 θ1 0 0 θ2 0 0  , Φ =   θ1 θ2 θ1 1 ρ12 θ2 ρ12 1  , Ψ =                Y X1 X2 X3 X4 X5 X6 Y ψyy 0 0 0 0 0 0 X1 0 ψ11 0 0 0 0 0 X2 0 0 ψ22 0 0 0 0 X3 0 0 0 ψ33 0 0 0 X4 0 0 0 0 ψ44 0 0 X5 0 0 0 0 0 ψ55 0 X6 0 0 0 0 0 0 ψ66               

Figure 2 . SEM model matrices for Figure 1. The Phi matrix contains 1’s and correlations because the latent variables are standardized.

Adaptation of prior work

Prior work (de Rooij et al., 2018; de Bruin, 2018) used a confirmatory factor analysis (CFA) framework, meaning the paths to be estimated were specified manually and any paths not specified were not estimated and instead fixed to 0. In the one-factor case, where all indicators were modelled as loading on one latent variable, that does not make a difference. In the two-factor case however, decisions need to be made beforehand: Assuming two latent factors, do all indicators load on all factors or is it specified what indicators go with what factors? In such a model with two latent factors, is it reasonable to constrain the indicators belonging to one factor to never load on the other factor? This could be acceptable if the constructs seem uncorrelated, but that is ultimately a subjective decision.

(9)

In the multi-factorial case it could be better to allow the indicators of one test to also load on the constructs represented by other tests. These loadings are called cross-loadings, as they are loadings on factors besides the (presumed) primary factor that an indicator loads on. If there are two tests that measure correlated concepts, such as “kindness” and “empathy” for example, constraining the indicators of both tests to load on only one of the latent factors is unrealistic due to the overlap in domain. This however is exactly what the legacy procedure, the linear model of item sum scores of each test, would do.

While CFA can have cross-loadings as well if specified, using a CFA approach where indicators never have cross-loadings, as applied by (de Bruin, 2018), is only replicating the flaw of the legacy procedure. In situations where it is not obvious whether or how constructs might be correlated, it seems better to adopt a modelling framework that allows for the possibility of all cross-loadings rather than specify them as non-existent a priori or having to specify them with risk of misspecification. That is why the previous CFA method was adapted to an EFA framework instead, which freely explores all possible cross-loadings.

As mentioned before, SEM is a larger framework and CFA and EFA are both possible within it; EFA is referred to as “E/CFA” (Exploratory/Confirmatory Factor Analysis) when used within a SEM framework (Brown, 2015, p. 168). E/CFA can be carried out using the lavaan package within R (Rosseel, 2012), which was used by de Bruin (2018) and de Rooij et al. (2018). It is however computationally simpler to carry out EFA using dedicated packages for factor analysis. The resulting fit of these methods are identical in terms chi-square and degrees of freedom when both estimated using maximum likelihood (Brown, 2015). As the simulation of the data was projected to take up a sizable amount of time and it was simpler to implement, this experiment does not formally use a SEM framework, but uses EFA instead.

Unlike how it might sound, this does not fundamentally change the method behind the proposed prediction method: Both CFA and EFA methods attempt to fit the covariance structure in the data, CFA only has more constraints. This model-implied covariance matrix is ultimately transformed into coefficients for the variables, with which predictions of the outcome variable can be made. How the methods arrive at this model-implied covariance matrix might differ, but the steps after remain identical. For the EFA method these steps are discussed in more detail in the prediction techniques section.

Simulation design

The basis of every simulated dataset is a true model that is specified according to input parameters that are varied. For each model in the experiment, each latent variable has an equal number of indicators that have their largest loading on that latent variable. Put otherwise: Each indicator in the model (excluding Y ) loads most strongly on one factor. This strongest loading is referred to as that indicator’s main-loading. It is possible for an indicator to load on multiple latent variables instead of just one, in which case all loadings besides the main-loading are referred to as cross-loadings. These terms and how they are handled are discussed further in the next section, when it will be described what input parameters are varied as part of the design and over what values.

Each combination of input parameters is used to create a true model. This true model then generates a dataset, which is split into a training-set and a test-set. The test-set is kept constant at size n = 1,000 so that the MSE can be fairly compared across conditions, but the

(10)

size of the training-set is varied. In all datasets the scores on the latent variables are drawn from a normal distribution with mean 0 and variance 1. The error terms of indicators in all datasets are also drawn from a normal distribution, with mean 0 and variance 1 minus the indicator’s communality. No model has correlated errors or regression coefficients between latent variables. For this experiment there are 100 repetitions per combination of input parameters. The prediction techniques are then applied to each dataset and the MSE of the predictions of each technique on the test-set is saved, providing 100 estimates of the MSE of each technique for each combination of input parameters.

Note that the procedure that was just described is not the same as randomly drawing 100 datasets from one true model, because the 100 true models for each combination of design factors are not necessarily identical, some of the input parameters have a random effect on the true model.

True model input parameters

In choosing the values for the input parameters I stayed close to those values used by de Bruin (2018) and de Rooij et al. (2018) wherever possible, so as to have this experiment be as close to a continuation as possible.

Fixed input parameters. Some of the input parameters were held constant for the experiment, either because they were not expected to be relevant or due to time constraints for the simulation. These fixed input parameters are discussed below:

• Number of latent factors: The number of latent factors in the true model. For the current experiment this was fixed at 6. It is not varied because it would multiply the simulation time of the experiment beyond what was possible within the time available. 6 was also chosen as I wanted a value greater than 3, to distinguish it from prior work in one- and two-factor models (de Bruin, 2018; de Rooij et al., 2018) and put it squarely in multi-factorial territory.

• Structural strength: This governs the variance in loadings of the outcome variable Y on each latent factor. For a value of 0, all loadings of Y on each latent factor are equal. For values greater than 1 these loadings are drawn from a uniform distribution with the bounds being set further away from each other for higher values of structural strength. In Figure 1 this design factor would decide the relative size of β₁2 com-pared to β₂2. In the current experiment this was fixed at 0 as it was not expected to have any meaningful discriminating ability between the predictive performance of the techniques used. Because it is not varied in the experiment, this input parameter is not explained further here, but more explanation can be found in the annotated code provided alongside this thesis. In Figure 1 this input parameter corresponds to the relative size of β2

1 and β22.

• Maximum cross-loadings: This sets the maximum number of cross-loadings in the true model. Cross-loadings were briefly discussed under simulation design, they are the loadings of an indicator that are on latent variables other than the latent variable it loads most strongly on. Using Figure 2 as an example: In the Λ matrix variable X₁ only has a loading on the first latent variable, θ₁. This loading, λ₁₁, is X₁’s

(11)

λ21, would be X1’s cross-loading and the condition |λ21| < |λ11| would apply as the

main-loading is always greater than any cross-loading. Hypothetically, if there was a third column in the Λ matrix in Figure 2, X1 could also have a cross-loading on that

third factor, which would be called λ₃₁. So, the maximum number of cross-loadings puts a cap on how many other latent factors each indicator can load on, besides the one it has its main-loading on. In the current experiment this was fixed to 1, so each indicator can at most load on two latent factors. This was fixed because the tests included in test batteries are generally chosen to measure different facets of a larger construct, so while the latent variables might correlate, individual indicators should not be expected to load on all latent variables. Also, fixing it to a constant value helped reduce the time needed for simulating the data.

• Size of test-set: As mentioned before, the size of the test-set was kept constant at n = 1,000 to provide a fair comparison across conditions for the prediction techniques. • Number of folds: As part of the EFA-based prediction technique a k-fold cross-validation procedure is used to select optimal hyperparameters. This will be explained in greater detail in the section about prediction techniques. This was fixed at k = 10 as it is a conventional number for cross-validation folds.

• Number of repetitions: This is the number of times a true model and subsequent dataset are generated using a combination of input parameters. This was briefly touched on under simulation design, but this was fixed at 100 for the current experi-ment.

Varied input parameters. These input parameters were varied in the experiment: • Indicator-factor ratio: This decides how many indicators there are in the true model

as a multiple of the number of factors. Values of 3, 5, and 7 were used. Since the number of factors was held constant in the current experiment, this input parameter can also be construed as either 18, 30, or 42 indicators (excluding outcome variable Y ) being present in the true model.

• Indicator communality: As discussed in the section on SEM, the communality is the proportion of the indicator’s variance that is explained by the model. This was varied with values of .3, .5, and .8. All indicators have the same communality except Y , which has a separate input parameter for its communality.

• Outcome communality: The outcome communality is the communality of the outcome variable Y . It was varied over values of .2, .3, and .4, to keep parity with prior work (de Bruin, 2018; de Rooij et al., 2018).

• Factor correlations: This argument decides whether and how strong latent variables in the true model correlate. This was varied over values of 0, .1, .3, and .5, where 0 indicated orthogonal factors and higher values are the conventional thresholds of the Pearson correlation for a small, medium and large correlation. In this experiment correlations were uniform, meaning all off-diagonal elements of the Φ matrix were equal. Also, it again helped reduce the projected simulation time.

(12)

• Cross-loading proportion: This input parameter governs the proportion of potential cross-loadings that are made cross-loadings in any true model. This was varied using values of 0, .3, .5, .8. A value of 0 means there are no cross-loadings in the model, while increasing values will cause the Λ matrix to become less sparse as more cross-loadings are created. More specifically, given a Λ matrix of dimensions p × q, the maximum number of potential cross-loadings is (p − 1) × (q − 1). (p − 1) because one row of Λ is for the outcome variable Y which does not have a concept of a main-loading and instead loads equally on all factors, and (q − 1) because one element in each row of Λ, except that for Y , represents the indicator’s main-loading and is excluded from possibly becoming a cross-loading. The maximum cross-loadings argument discussed before places a restriction on the number of elements in Λ that can be non-zero; if the maximum number of cross-loadings is 1, as in this experiment, at most (p − 1) × 1 loadings can be added: One for each indicator other than Y , assuming the cross-loading proportion was 1. If the cross-cross-loading proportion is .3 and the maximum cross-loadings is 1, then there are .3 ∗ ((p − 1) × 1) cross-loadings in the true model, rounded down to the nearest integer. These cross-loadings are decided randomly, so two true models created with the same input parameters are not necessarily identical. • Factor-purity: This is the ratio of squared main-loading to the sum of squared cross-loadings for any indicator that has cross-cross-loadings. The factor-purity is a constant value. It serves as a constraint to make loading calculation simpler, which is discussed in a later section. It is used as a constant, which is why in all equations in this thesis, u is never given a subscript. In describing the concept of cross-loadings for the maximum cross-loadings input argument it was mentioned that if variable X1 in Figure 2 had

an additional loading on the second latent variable, that would be a cross-loading. It was also shortly mentioned that the condition |λ₂₁| < |λ₁₁| would hold. Expanding on that, the general form is: For λij being the the loading of indicator j on factor

i, m being the latent factor that an indicator has its main-loading on (it is therefore always true that 1 ≤ m ≤ q), indicator j having at least one cross-loading, and u being the factor-purity that is uniform across indicators, u is equal to

u = λ 2 mj q P i=1 i6=m λ2 ij . (2)

Under the experiment’s constraint that all cross-loadings of an indicator are equal, for b_j being the number of cross-loadings of indicator j, and d_j being the value of each cross-loading of indicator j, this can be rewritten as

u = λ

2

mj

bj ∗ d2j

. (3)

This means that for a higher factor-purity, more of an indicator’s communality is explained by the main latent variable it loads on as opposed to its cross-loadings. It is varied over values of 1.5, 3, and 4.5. The lower bound of 1.5 was chosen as it

(13)

is still above 1. This is important because for a factor-purity equal or lower than 1 the sum of squared cross-loadings matches or exceeds the squared main-loading. In the experiment, because there is only one cross-loading possible (input argument of the maximum number of cross-loadings was fixed to 1), that would mean that the squared cross-loading (singular) would on its own be equal or bigger than the main-loading. If that is the case, the main-loading is no more “main” than the cross-loading. Furthermore, it would add unwanted variability to the results since for some true models it would mean that the number of indicators that load most strongly on each latent variable is not all equal, since cross-loadings are determined randomly. The steps to 3 and 4.5 were chosen arbitrarily.

• Size of training-set: The size of the training-set was varied using values of n = 200, n = 500, and n = 1,000.

Table 1 presents the discussed input parameters and their value or range of values, as well as a short description. So the experiment is a 3 × 3 × 3 × 4 × 4 × 3 × 3 full-factorial design (3,888 total combinations) with 100 replications per combination. Lastly, the R code written for the simulation is made to accommodate more than just the values mentioned in this section for the input parameters. If there is any interest is using the code, it is included with this thesis along with a document which describes the code in greater detail.

Table 1

Summary of simulation input arguments Input argument Value(s) Description

Number of factors 6 Number of latent variables

Structural strength 0 Relative proportions of the squared loadings of Y Maximum cross-loadings 1 Maximum number of latent variable loadings

any indicator may have besides its main-loading Size of test-set 1,000 Number of simulated cases in the test-set Number of folds 10 Number of folds used in cross-validation

procedure for EFA-based prediction technique

Number of repetitions 100 Amount of times each combination of input parameters is used to create a model and subsequent dataset

Indicator-factor ratio 3, 5, 7 Ratio of the number of indicators to the number of factors Indicator communality .3, .5, .8 Communality of indicators, excluding Y

Outcome communality .2, .3, .4 Communality of outcome variable Y

Factor correlations 0, .1, .3, .5 Uniform correlation strength between latent variables Cross-loading proportion 0, .3, .5, .8 Proportion of potential cross-loadings that are cross-loadings Factor-purity 1.5, 3, 4.5 Ratio of squared main-loading to sum of squared cross-loadings Size of training-set 200, 500, 1,000 Number of simulated cases in the training-set

(14)

Calculation of loadings in presence of cross-loadings

As explained in the short introduction to SEM models, a variable j that loads on only one factor is very easy to handle: If the communality c of variable j is cj, then the loading on

that one latent factor is simply √cj. However, in a two-factor model, if that same variable

j has a cross-loading and the two latent variables have a correlation, it is more complex. For ρ being the correlation between the two latent variables, the communality cj is now

cj = λ1j2+ λ2j2+ 2 ∗ λ1j ∗ ρ ∗ λ2j. (4)

It is easily seen that it only becomes more complex if more cross-loadings are present. For now, I will focus on the case of a two-factor model and generalize to an any-factorial model afterwards.

In (4) both c_j and ρ are known, but that leaves two unknown values in the equation:

λ2

1j and λ22j, which means there is not yet a unique solution. One way of solving this is

by expressing either λ2₁₁ or λ2₂₁ as a multiple of the other. Observant readers will notice this is exactly what the factor-purity input parameter does that was discussed under the input parameter section. Within this two-factor example: Given the uniform factor-purity

u, the correlation between the two factors ρ, and λmj being the main-loading for variable

j, and variable j not being the outcome variable, the formula for the communality can be rewritten as cj = λ2mj + 1 u ∗ λ 2 mj+ 2 ∗ λmj∗ ρ ∗ r 1 u ∗ λmj (5) = 1 +1 u ∗ λ2_mj+ 2 ∗ ρ ∗ r 1 u ∗ λ 2 mj (6) = 1 + 1 u + 2 ∗ ρ ∗ r 1 u ! ∗ λ2_mj. (7)

Then solving for λmj:

λmj = v u u t cj 1 +1_u + 2 ∗ ρ ∗q_u1 . (8)

Once λmj is known, the value of the cross-loading d for variable j (two-factor example,

so there is only one cross-loading, b = 1) can be calculated by rewriting (3):

u = λ 2 mj bj∗ d2j (9) bj∗ d2j = λ2_mj u (10) d2_j = λ2_mj∗ 1 u ∗ bj (11) dj = λmj∗ s 1 u ∗ bj . (12)

(15)

As b_j, λ_mj and u are known, the value of the cross-loading d_j can then be calculated. Note that as b = 1 in this example, the cross-loading d_j is equal to λ_mj∗q1

u here.

More generally there can be more factors than two and more than one cross-loading, which also means the possibility of more than one correlation between factors (though they are kept uniform in the experiment, as well the maximum number of cross-loadings always being 1 in the experiment). For ρrs being the correlation between latent variable r and

latent variable s, the formula for the communality c of indicator j then becomes

cj = q X i=1 λ2_ij + 2 ∗ X 1≤r≤(q−1) X r<s≤q λrj∗ ρrs∗ λsj. (13)

Within the constraints of the experiment however, factor correlations are uniform and the squared cross-loadings sum to a fixed value, so this can be simplified further: For bj

being the number of cross-loadings that an indicator j has, indicator j not being the outcome variable, and bj ≥ 1, λmj being the main-loading of indicator j, ρ being the uniform factor

correlation, and u being the uniform factor-purity, the communality cj of indicator j is then

cj = 1 + 1 + ρ ∗ bj− ρ u + 2 ρ ∗ q bj ∗ r 1 u !! ∗ λ2_mj. (14)

See Appendix B for the steps to arrive at this. From this equation the main-loading λij

can then be found using

λmj = v u u t cj 1 +1+ρ∗bj−ρ u + 2 ρ ∗p bj∗ q 1 u . (15)

As before, once λ_mj is known, the values of the cross-loadings d_j can then be determined using dj = λmj∗ s 1 u ∗ bj , (16)

and as was said at the beginning of this section, if the number of cross-loadings b_j is equal to 0 for indicator j, then the value for the main-loading of indicator j is simply the square root of the communality cj.

The reason it was specified in both the two-factor example and the general case that the variable j could not be outcome variable is because all factors are equally related to the outcome variable within the experiment. The factor-purity u is only used for indicators that are not the outcome variable. For the outcome variable, the value of its loadings on all factors (because they are all equal) can be calculated in a simpler way by modifying (13): For cY being the communality of the outcome variable, q being the number of latent

factors, ρ being the uniform factor correlation, λ_iY being the value of all loadings of the outcome variable (because of the constraint that all factors influence the outcome variable equally), cY is equal to

(16)

cY = q X i=1 λ2_iY + 2 ∗ X 1≤r≤(q−1) X r<s≤q λrj∗ ρ ∗ λsj (17) = q ∗ λ2_iY + (q2− q) ∗ ρ ∗ λ2 iY (18) =q + ρ ∗q2− q∗ λ2_iY. (19)

The simplification comes from knowing that all loadings of Y on the latent factors are equal, Y loads on all factors (so there are q loadings), and the factor correlation ρ is uniform. The first term in (18) represents the q paths up and down each factor loading, the second term the paths up one factor, through ρ, then back down another factor. There are two of each path, because each of these paths can be traversed in reverse order as well. That is why the second term contains the number of unique paths (q∗(q−1))₂ multiplied by 2, which has removed the denominator. From this the value of the loading λiY that the outcome

variable Y has on each factor can be found as before:

λiY =

s

cY

q + ρ ∗ (q2_{− q)}. (20)

Both these simplifications are only possible because of the constraints that factor corre-lations are uniform, cross-loadings are equal, all factors are equally related to the outcome variable Y . Plus, it only calculates the value of one indicator’s main-loading at a time (or in the case of Y , all values). While it is useful to work out specific cases such as these and make the formulas easier to understand, it is arguably more useful to create a general formula that does not require these constraints. Equation (13) can be rewritten using linear algebra instead and allows calculation of Λ, the complete p × q loadings matrix. First the p × q weight matrix W is created; in p − 1 rows, excluding the outcome variable Y , this matrix contains a 1 in the q’th column representing the factor that the indicator has its main-loading on, and if an indicator has at least one cross-loading (bj ≥ 1), that indicator’s

row also has

s 1

u ∗ bj

(21) in bj other columns of that indicator’s row. In 1 row, that of the outcome variable Y , there

are only 1’s, because all factors influence Y equally.

Then for c being the 1 × p vector of indicator communalities, W being the p × q weight matrix, Φ being the q × q matrix of correlations between latent variables in the true model, and v being a 1 × q vector of 1’s, the loadings matrix Λ is given by

Λ =

q

c diag (WΦW>)

>

v ◦ W. (22)

The and ◦ denote respectively the Hadamard (entry-wise) division and product, where (A B)_ij = Aij

Bij and (A ◦ B)ij = (A)ij(B)ij. Note that the main-loading does not need to

have the weight 1 (corresponding to the 1 in the numerator of (21) and the 1 in the cells of the main-loading of each indicator in W), it is the ratio between main- and cross-loadings that is used in the formula, but by using 1 it is more intuitive to inspect as the product of

(17)

the first term (the square root term) is already the vector of main-loadings for each indicator (and the value of every loading for the outcome variable, since all loadings are equal for the outcome variable).

To explain, the diagonal of the term within parentheses in (22) gives a vector of length 1 × p. This vector contains the first term of (14) for each indicator: it contains all the summed valid paths from an indicator to itself, as any invalid paths are multiplied by a 0 either in the weight matrix W or the matrix of factor correlations Φ. It does this for each indicator and the end result is a calculated vector of quantities of squared main-loadings t for each indicator that the communality of that indicator should be equal to: By re-expressing all potential cross-loadings as a fraction of the main-loading (using factor-purity u), the communality of any indicator can be shown to be equal to a real value multiplied by the squared main-loading, as was shown by (14). Knowing that, for c being the 1 × p vector of indicator communalities, t being the 1 × p vector of quantities, and f being the vector of squared main-loading values for all indicators, c is then

c = t ◦ f . (23)

As the Hadamard product is commutative, it can be rewritten as

f = c t, (24)

which is the term under the square root in (22). The square root is then taken because the Λ matrix should contain the unsquared loadings. The resulting 1 × p vector is then transposed to a p × 1 column vector and multiplied with the 1 × q vector of 1’s, v, to effectively copy the original column vector q times. This expanded p × q matrix is then multiplied element-wise with the weight matrix W. Since the indicators each had a 1 in the column of the factor they had their main-loading on in W to begin with, that means the originally calculated value for the main-loading isn’t scaled, but since indicators with cross-loadings (excluding Y as it was handled differently) had (21) in each of bj columns, this

element-wise multiplication is essence carries out what was previously done in a separate step using (16), but for all indicators except Y at the same time.

This general formula accommodates both varying factor correlations as well as any number of cross-loadings (including the absence of cross-loadings) and any combination of weights.

Prediction techniques

Four prediction techniques were applied to the simulated data: A sum-score linear model, a ridge regression model, a PLS model, and the EFA-based prediction technique. These techniques will be described in a bit more depth below where necessary.

Sum-score linear model. Earlier it was described that each variable in a true model loads most strongly on one latent variable. Using that knowledge, the sum-scores are found by summing the scores of variables that have their main-loading on the same latent variable. In a true model with q latent variables, that means q sum-scores. These (plus an intercept column of ones) are used to obtain the vector of least-squares estimates b_OLS that minimizes the sum of squares using the normal equation

(18)

SS = (y − Xb)>(y − Xb) (25) =y>− b>X>(y − Xb) (26) = y>y − y>Xb − b>X>y − b>X>Xb (27) ∂SS ∂b = −2X > y + 2X>Xb (28) X>Xb = X>y (29) b_OLS =X>X−1X>y. (30)

It might seem that knowing what the main-loading of indicators is, is using information from the true model generation, which is not apparent when possessing only the generated dataset. In the context of handling results from a psychological test battery, the assumption made by using the sum-score linear model is that all items of a specific test load only on the construct represented by that test in the first place, therefore they can be summed. Given that inherent assumption, I do not consider the use of this information improper. It is also better to compare other techniques against the best-case scenario to be able to draw valid conclusions that aren’t conditional on the sum-score linear model being misspecified.

Using the least-squares coefficient estimates, predictions are made and an MSE is ob-tained.

Ridge regression. Ridge regression is a popular technique, as it is both parametric, so easy to interpret, yet its predictive performance is better than unbiased linear regression by incorporating shrinkage through the `2 norm: It minimizes the sum of squares plus the

squared sum of estimated coefficients, the latter weighted by tuning parameter λ. For β0

being the intercept and β₁. . . βp being the coefficients for the p predictors, yibeing the score

on outcome variable Y for individual i and xij being the score on predictor xj of individual

i, and δ being the tuning parameter, the vector of ridge coefficients bridge is given by

b_ridge = arg min

β      n X i=1  y_i− β₀− p X j=1 xijβj   2 + δ p X j=1 β_j2      . (31)

If all predictors have been centered the ridge coefficients are equivalently given by (Hastie et al., 2008, p. 64)

b_ridge =δI + X>X−1X>y, (32) where the tuning parameter δ is usually chosen through k-fold cross-validation: Dividing the training-set into k folds, then training the ridge model on k − 1 folds and predicting the outcome variable Y in the remaining fold. This is done k times with many different values of the tuning parameter δ, obtaining k estimates of the out-of-sample MSE for each value of δ that was used. By taking the average over those k estimates the on-average best value of the tuning parameter δ is found.

LASSO and group LASSO were considered as well, but keeping in mind that commu-nalities are uniform over indicators, latent variable correlations being uniform, there at

(19)

most being one cross-loading, and that latent variables are always equally strongly related to the outcome variable Y , only three unique values of loadings are present in any true model between the indicators (excluding Y ) and the latent variables: One if a variable has no cross-loadings (and its main-loading is the square root of the communality), a second and third if there is a loading (in which case the value of the main-loading and cross-loading follow from the closed-form equation in the cross-loading calculation section). This means the LASSO and group LASSO would both be in an inherently disadvantageous situation: Given variables (or groups of variables) that are in the true model equally related to the outcome, it is known beforehand that any variable (or group of variables) selection done by the (group) LASSO would only be based on sampling variation. As prediction is the only goal here as well, ridge regression’s effect of keeping all variables in the model even though their coefficients become very small is not an issue. Plus, ridge regression handles collinearity better than LASSO; in situations where communalities of indicators are quite high and the Λ matrix being very sparse, indicators sharing the same main latent variable would likely show a high degree of overlap.

The optimal value of the tuning parameter δ was found using 10-fold cross-validation. This value is that for which the average MSE over the folds was the lowest. Sometimes researchers use a value of δ that is a bit higher (more shrinkage), to account for the full training-set also being but a sample of the total population and therefore erring on the side of caution (Friedman, Hastie, & Tibshirani, 2010; Breiman, Friedman, Olshen, & Stone, 1984). To do this, one standard error of the optimal MSE is added to the optimal MSE and the highest δ within this new upper bound is used. Because it uses one standard error, this is called the one standard error rule. To make the current experiment a more direct continuation of prior work (de Rooij et al., 2018; de Bruin, 2018), which did not use this one standard error rule, it will not be used in the current experiment either. After the optimal value of δ has been found, ridge regression is carried out on the full training-set and predictions are made for the test-set, after which the MSE can be calculated.

Partial Least Squares. PLS is a dimension reduction technique: Similar to PCR it finds z linear combinations of the original p variables (excluding the outcome variable), under the constraint that z ≤ p, thereby reducing the dimensions of the original data matrix (unless z = p combinations are created and no dimension reduction takes place), though the aim of PLS differs from that of PCR: While PCR finds the directions along which the data varies most, assuming that those same directions would explain the outcome variable Y most, PLS is a supervised technique and uses the outcome variable Y directly in the creation of those z linear combinations.

Within the experiment there is only one outcome variable, so the alghorithm is simpli-fied (It is assumed that the data matrix is standardized): Directions are chosen through a repeating process of creating linear combinations of the variables using univariate regres-sion coefficients of Y on each variable: The univariate regresregres-sion coefficients are used as weights in the creation of the new linear combination of the p variables, so variables that have a stronger linear relationship with Y on their own have greater weight in the linear combination. Each original variable is then separately regressed on the newly created lin-ear combination to obtain p sets of residuals; these residuals represent the variance in each original variable that was not explained by the linear combination. These residuals are then used as the new values for the p variables and the next linear combination is found through

(20)

the same process, until z linear combinations are found. Finally, these z linear combinations are then used in a linear model to predict the dependent variable. The regression coefficients for the z linear combinations can then be transformed back to coefficients for each variable p in the original data matrix (Hastie et al., 2008, p. 81; Wold, Sjostrom, & Eriksson, 2001). More efficient algorithms exist, as well as algorithms for multiple dependent variables, but this functions as an intuitive way to view PLS.

The number of components fitted was decided using 10-fold cross-validation, again not using the one standard error rule, so the number of components that gave the optimal MSE from the cross-validation was found. PLS was potentially also able to select 0 components as being optimal, which would result in mean-prediction. PLS was then carried out using the selected number of components on the full training-set and an MSE was obtained based on predictions for the test-set.

EFA-based prediction rule. The formula for the model-implied covariance matrix Σ in a SEM model is:

Σ = Λ(I − B)−1Φh(I − B)−1i>Λ>+ Ψ, (33) where Λ is the p × q matrix of factor loadings, B is the q × q matrix of regression coefficients between latent factors, Φ is the q × q variance-covariance matrix of latent factors, Ψ is the p×p covariance matrix of measurement errors of the observed variables, and I is the identity matrix. In the models used in this thesis and the model in Figure 1 there are no directed paths between latent factors; they may covary but no latent factor ever causally influences another. This means that the B matrix is always empty, so that the formula simplifies to

Σ = ΛΦΛ>+ Ψ. (34)

In the experiment, unrotated solutions from an EFA analysis are used and as these are orthogonal, the Φ matrix is an identity matrix. This means that for the purposes of this experiment, the formula is further simplified to

Σ = ΛΛ>+ Ψ. (35)

For g being the number of indicators that are not the outcome variable Y (for one outcome variable g is therefore equal to p − 1, this model-implied covariance matrix can then be partitioned like so

Σ =       σY Y σY X1 . . . σY Xg σX1Y σX1X1 . . . σX1Xg .. . ... . .. ... σXgY σXgX1 . . . σXgXg       = " ΣY Y ΣY X ΣXY ΣXX # . (36)

Standard linear algebra provides the coefficients of the expectation of Y given the indi-cators X_1...g as the 1 × g vector γ (gamma):

γ = Σ_{Y X}ΣXX

−1

(21)

For y_i being the score on outcome variable Y for individual i, x_ij being the score on indicator Xj for individual i, ˆγ as the vector of estimated coefficients, the prediction of Y

for individual i is then

ˆ yi= ¯y + g X j=1 γj(xij− ¯xj). (38)

Here 10-fold cross validation was also applied to the training-set to see what number of factors gave optimal predictions; the range of factors tried was 1-10. A detail is that unlike techniques such as PCR and PLS where a maximum number of components is fitted, then a subset is retained afterwards, factor analysis fits a specific number of factors to the data during the procedure. This means that for two EFA models fitted on the same dataset, one with q factors fitted and the other with (q − 1) factors fitted, the loading matrices of these models will not be identical up to column (q − 1). This is something that would be true for PCR and PLS models: Whether 2 components or 10 are fitted (on the same dataset), the loadings of the first component would be identical, and so are those of every component that the two models share after the first. Each of the 10 EFA models per division of folds had their own γ vector of coefficients. Each of these 10 γ vectors was then subjected to varying amounts of shrinkage; values used were 1 (no shrinkage) to 0 (mean prediction) in steps of .05. After all folds had been used as the validation set, all combinations of the number of factors to fit and shrinkage then had 10 estimates of the out-of-sample MSE. The lowest average MSE was found for a specific combination of number of factors to fit and shrinkage. This optimal combination was then used to fit an EFA model to the full training-set and predictions were made on the test-set, from which an MSE for this method can be calculated.

Software and packages used

The data was simulated, organized and analyzed using R (R Core Team, 2018). As part of the data simulation functions from the MASS package were used (Venables & Ripley, 2002). For the prediction techniques: The psych package provided the factor analysis functions (Revelle, 2018), the glmnet package for the ridge regression functions (Friedman et al., 2010), and the pls package the functions for PLS (Mevik, Wehrens, & Liland, 2018). The dplyr package and the tidyr package were used for general data wrangling (Wickham, François, Henry, & Müller, 2018; Wickham & Henry, 2018). The sjstats and pwr packages were used to calculate effect sizes (Lüdecke, 2019; Champely, 2018). Lastly, the ggplot2 package was used for visualizations (Wickham, 2016).

Results strategy

Analysis. Once the results are obtained, an exploratory main-effects ANOVA is ap-plied with the design factors and prediction technique as between-subject predictors. Pre-diction technique is treated as nominal, all other predictors are treated as ordinal and dummy-coding is used for these. The predictors besides prediction technique were not treated as metric, because the usual advantages of doing so (fewer estimated parameters, no combinatorial explosion when dealing with interaction effects) are not relevant here: Most design factors have only three unique values and at most four; this means interpreting at most three dummy estimates, which is not difficult. As it it a main-effects ANOVA,

(22)

there are also no interaction effects. Finally, choosing to treat the design factors as metric is implicitly assuming that a change of a given size will produce the same effect regardless of where on the scale of the design factor this change is. This assumption is not intuitive for design factors such as factor-purity or factor correlations, so I prefer not having to make this assumption.

With eight design factors there will be eight main effects tested simultaneously in this main-effect ANOVA model. As such, a correction for multiple testing will be applied. In this experiment it is the Type II (false negative) error rate that is more important: Based on the result of the ANOVA, predictors could be dropped from the model (and later visualization); an unimportant design factor remaining in the visualization would do less harm to the interpretation than an important design factor being dropped. As a result, it was chosen to control the false discovery rate, which usually decreases the Type II error rate more than methods focusing on the family-wise error rate (Cramer et al., 2016). The Benjamini-Hochberg (BH) procedure (Benjamini & Hochberg, 1995) will be used to adjust p-values of predictors. The p-values of the univariate tests of the dummy estimates will not be adjusted. This is because it is only the estimate magnitude, the signs of the estimates, and the order of levels within each factor that are relevant to the visualization, not whether any specific dummy estimate of a factor is significantly different from that factor’s reference level. The unadjusted p-values of the dummy estimates will still be included in the results for the sake of completeness.

The effect size η2of each predictor is also calculated as it is expected that predictors and dummy estimates will reach significance easily due to the large sample size. The formula for η2 is

η2 = SSeffect SStotal

, (39)

so it can be interpreted as the proportion of variance explained by a predictor compared to the total variance explained by the model. This simple interpretation makes it attractive as an effect size, however some objections have been raised against using η2 as it is a biased effect size: It is an effect size for the sample only. Instead, a bias-corrected effect size such as ω2 should be used instead, but this is mostly relevant when samples are small (Ferguson, 2009; Fritz, Morris, & Richler, 2011). Given the very large sample provided by the simulating of the data, I expect these two effect sizes to be near identical. Both effect sizes will still be inspected and if differences at the second decimal are seen for any predictor, I will interpret ω2 instead of η2 for each predictor.

A 95% confidence interval (CI) for the effect size will also be given. This CI is based on the bootstrapped effect size mean ± the bootstrap standard error of the effect size multiplied by the quantiles of the t-distribution enclosing 95%, with degrees of freedom equal to the number of bootstrap samples - 1. The number of bootstrap samples that will be used is 1000. Using η2 _{as the example: For the subscript ∗ denoting the bootstrapped effects sizes,}

the 95% CI of each predictor’s η2 would be given by

η2

∗± t999,.025∗ SE(ηc_∗2). (40) This main-effects ANOVA is done to obtain an idea of what design factors were influen-tial, what the signs of their effects were, and possibly if some proved unimportant and can be removed from the model to make it simpler (which implicitly averages over that design

(23)

factor). It also allows inspection of the estimates for the prediction techniques, which can provide a hint as to whether the EFA technique proved competitive on average. Ultimately, the main-effects ANOVA is intended only as a guide for the visualization of results and possibly weeding out unimportant design factors (which also helps visualization). This vi-sualization of the results is what will be interpretated to answer the research question, not the ANOVA itself.

Visualization. The results of the main-effects ANOVA do not allow a clear picture of the results yet, because it is not possible to see exactly where certain prediction techniques performed well. In prior work for the two-factor case (de Bruin, 2018) facetted line-plots were used. Due to the large amount of simulation conditions (3,888 if no unimportant design factors are found and dropped), that would require nearly 50 full pages of plots here, if no design factors are found to be umimportant and are collapsed over in the data. To allow visual interpretation of results without making it too cumbersome required a different visualization method; Using the idea of Rücker and Schwarzer (2014), nested loop plots were chosen for this task. For full-factorial simulation studies as done in the current experiment, nested loop plots present a central outcome variable on the vertical axis, while ordering design factors on the horizontal axis, producing a graph that looks similar to a time-series plot. This offers very dense, yet interpretable plots, and allows a reader to quickly determine which combination of design factors produced a corresponding result.

The order of design factor splits was determined by taking the absolute value of the dummy estimate of each design factor’s highest level, dividing by the number of unique values of each design factor, then ordering based on absolute magnitude. Within each design factor, the levels were ordered either ascending or descending, depending on which of the two reduced MSE. This produced satisfactory results as seen in for example Figure 3, where there is a visible decreasing trend from left to right in MSE, without simulation condition MSE’s "jumping" too much from condition to condition.

The original idea of nested loop plots was put forth by Rücker and Schwarzer (2014), but the scope of the function was expanded and it was implemented using fully original code. The produced plots are now also made using the ggplot2 package (Wickham, 2016), which is common for publishing visualizations and allows easier editing of the plot objects. The code itself and an annotated document are provided separately from this thesis.

If the EFA method appears promising under specific conditions seen in the visualization, a follow-up simulation study will be done. This one would be smaller in scope and is meant as a verification of the earlier findings, so it focuses on those specific conditions in which the EFA method was seen to perform best out of all prediction techniques. Naturally, if the EFA method does not appear promising, no follow-up simulation is performed.

Expectations

The selection of input arguments actually favors the sum-score linear model: The major weakness of sum-score prediction is that it assumes each variable that is summed contributes equally to the latent variable (or conversely, that the latent variable influences each of its indicators equally), but the communalities being uniform across all variables in the true model means this weakness is somewhat mitigated. Another assumption is that all variables within a sum contribute only to that sum, or put otherwise, that there are no cross-loadings. Since there is at most one cross-loading in all true models for any indicator excluding Y ,

(24)

this is again mitigated somewhat. Conditional on the cross-loading proportion not being 0 already, as the factor-purity gets higher, the impact of potential cross-loadings diminishes even further as the main-loading becomes relatively more important and since the sum-score linear model uses only the main-loadings, it would again stand to benefit. There is also no danger of misspecification, since the model is fitted using the correct item-factor mappings from the true model. Based on all this, it is expected that this technique will perform quite well relative to the other techniques, though one of the other techniques outperforming the sum-score linear model in these nearly ideal conditions would be a pleasant surprise.

In prior work, in the case of one-factor models (de Rooij et al., 2018), the CFA-based technique was also compared against ridge regression: There it outperformed ridge in all situations except when both the training-set was small (n = 100) and the communalities of indicators were low (.3 or .5). This difference grew larger as the number of indicators was increased (de Rooij et al., 2018). Prior work in the case of two-factor models found the CFA-based technique on-average outperforming ridge regression too (de Bruin, 2018). The lowest size of the training-set in the current experiment is n = 200 however, so I would tentatively expect the EFA-based technique in the current experiment to uniformly perform equal or better than ridge regression, but the gap is still expected to increase in proportion to the number of indicators in the model.

Partial least squares and ridge are noted to perform very similarly, but ridge regression is generally preferred if the goal is to minimize prediction error, though the difference was very slight (Frank & Friedman, 1993). In the context of the current experiment, I do not hold any particular expectations about either how well this technique would perform or in which specific situations.

Results Main simulation

As decided in the analysis strategy, a main-effects ANOVA was fitted first, treating all design factors as ordinal. The ANOVA table for this model is shown in Table 2 and the dummy estimates of this model are shown in Table 3.

Table 2

ANOVA table for the main-effects model

Predictor SS df MS F pBH η2 Indicator-factor ratio 162.15 2 81.076 58,807.31 .001 .014 [.013–.014] Indicator communality 607.79 2 303.895 220,426.12 .001 .052 [.051–.052] Outcome communality 8,327.64 2 4,163.822 3,020,174.98 .001 .710 [.709–.710] Factor correlations 341.30 3 113.766 82,518.87 .001 .029 [.029–.029] Cross-loading proportion 29.95 3 9.984 7,242.06 .001 .003 [.002–.003] Size of training-set 104.40 2 52.201 37,863.62 .001 .009 [.009–.009] Factor-purity 0.04 2 0.018 13.11 .001 .000 [.000–.000] Technique 18.24 3 6.079 4,409.69 .001 .002 [.002–.002] Residuals 2,144.08 1555180 0.001

Note. SS = Sum of Squares, MS = Mean Square, pBH= Benjamini-Hochman adjusted p-value, η2 = eta-squared. Bracketed

values after η2

(25)

Table 3

Dummy estimates of the main-effects ANOVA model

Predictor Level Estimate SE t p

(Intercept) 0.9153 0.0001 6,874.32 .001 Indicator-factor ratio 5 −0.0164 0.0001 −224.69 .001 7 −0.0246 0.0001 −336.73 .001 Indicator communality .5 −0.0312 0.0001 −428.33 .001 .7 −0.0477 0.0001 −653.53 .001 Outcome communality .3 −0.0896 0.0001 −1,229.00 .001 .4 −0.1792 0.0001 −2,457.71 .001 Factor correlations .1 −0.0160 0.0001 −189.48 .001 .3 −0.0314 0.0001 −372.74 .001 .5 −0.0384 0.0001 −456.49 .001 Cross-loading proportion .3 −0.0053 0.0001 −62.77 .001 .5 −0.0084 0.0001 −99.79 .001 .8 −0.0120 0.0001 −141.93 .001 Size of training-set 500 −0.0142 0.0001 −194.94 .001 1,000 −0.0194 0.0001 −265.68 .001 Factor-purity 3 0.0002 0.0001 2.79 .005 4.5 0.0004 0.0001 5.11 .001 Technique PLS −0.0036 0.0001 −42.67 .001 Ridge 0.0060 0.0001 71.15 .001 Sumscore 0.0006 0.0001 7.60 .001

The adjusted R2 for this model is .8173, so over 80% of the variance in the data can be explained by the main effects of the design factors. From Table 2 it is seen that all design factors and the prediction technique are significant predictors: Each p_BHis extremely close to 0. To expand on that shortly, almost every p_BH-value in Table 2 is below the smallest positive floating-point number that the computer could accurately calculate (2.2∗10−16); the value is so small that the computer the analysis was conducted on was unable to distinguish it from 0. The sole exception is factor-purity, its adjusted p_BH-value is merely 2.035 ∗ 10−6. I mention this in such detail because factor-purity being the only design factor with a non-zero pBH-value will influence a decision made later in this section. It is also mentioned

because it affected the correction for multiple testing: As part of the BH procedure, the largest p-value (factor-purity) is not adjusted. All others would have been adjusted to varying degrees by multiplying with a different value for each p-value, but because all other p-values were internally represented as 0, they remained at 0 after multiplication. So all pBH-values in Table 2 remained the same as before adjustment. As anticipated, p-values

are less useful in large simulation studies, but effect sizes are also available.

The η2 for each predictor is shown in Table 2; Both η2 and ω2 were inspected for each predictor and were found to be identical up to the fourth decimal, so as stated in the analysis strategy, the η2 will be used for interpretation. The 95% confidence interval of