Handling Missing Values in Multi View Data with Stacked Penalized Regression

(1)

Master’s Thesis Psychology,

Methodology and Statistics Unit, Institute of Psychology Faculty of Social and Behavioural Sciences, Leiden University Date: August 2020

Student number: s2451018

Supervisor: Dr. Marjolein Fokkema & Wouter van Loon

Handling Missing Values in Multi-View Data

with Stacked Penalized Regression

(2)

Index

Abstract ... 3 Introduction ... 4 Method ... 9 Dataset ... 9 Study design ... 10 Statistical analysis ... 13 Results ... 15

Features missing randomly ... 15

Views missing completely ... 20

Discussion ... 27 References ... 31 Appendix A ... 34 Appendix B ... 36 Appendix C ... 41 Appendix D ... 46

(3)

Abstract

Excessive data collection can be expensive to perform and process. Feature selection methods such as penalized regression are useful techniques to exclude less relevant features from a prediction model. However, these methods cannot be directly applied to multi-view data, as it ignores the grouped structure of this type of data and valuable information may be lost. For multi-view data Stacked Penalized Regression (StaPLR) can be used for view selection. A challenge arises when missing data is present, especially when complete views are missing. Using listwise deletion, the most common method to deal with missing values, on these datasets will therefore likely lead to a lot of information loss. In this thesis the effect of mean substitution, single Bayesian regression, multiple Bayesian regression and predictive mean matching on the predictive accuracy and sparsity of StaPLR are compared on different missing data scenarios on an existing dataset containing items from the Mood and Anxiety Symptoms Questionnaire (MASQ). In addition to traditional feature imputation, StaPLR offers the possibility to impute view-level predictions when views are missing completely, which can decrease the computational burden. To evaluate the performance of the missing data methods, 27 missing data scenarios are created with features missing either randomly or within the view structure. Predictive accuracy is measured with AUC, and sparsity by the number of views StaPLR selects. Multiple imputation works well in terms of predictive accuracy, but when quantities of missing values increases they do not work well in terms of sparsity. Mean substitution could potentially be an acceptable alternative to multiple

imputation when features are missing randomly. View-level imputation performs similar as feature-level imputation for most methods when views are missing completely, and could therefore be used for these methods to decrease the computational burden. This thesis is the first step in handling missing values with StaPLR. Future research directions include repeating this research on more elaborate multi-view data structures, where the views are collected from different sources, to identify whether these effects are found under different circumstances.

Keywords: Stacked Penalized Regression, StaPLR, Missing Data, Multi-view Data, Multiple Imputation

(4)

Multi-view data is commonly used throughout different research fields (Li et al., 2016; Zhao et al., 2017). In multi-view data, predictor variables are grouped into meaningful subsets (Li et al., 2016). A simple example of this data, which is used often in social science, is in the form of a questionnaire. The individual questions are referred to as features, and the subscales they form are referred to as views. A well-known example is the Big Five Inventory (BFI; John & Srivastava, 1999) where each question belongs to one of the five subscales. These subscales are then used as predictive measures for an external criterion variable. More elaborate multi-view data structures are found in biomedical research, where the views are often collected from different sources, such as questionnaires, various imaging data or genomics data (Li et al., 2016). Using multiple feature sources to try and predict the same phenomenon can lead to a deeper understanding, and a more complete perspective of the phenomenon. This could potentially lead to a more accurate prediction of this phenomenon. There is, however, a downside to the use of multiple sourced data collection. Depending on the source, some data collection methods can be straining for the patients involved, and/or expensive to perform or process. To avoid excessive data collection it would be advantageous to be able to identify which views yield the most accurate predictions.

Feature-selection models, such as (logistic) regression with a lasso penalty (Tibshirani, 1996), are very useful techniques to exclude irrelevant variables from a prediction model. These techniques do not only improve prediction accuracy, by reducing overfitting of the model, but also help with the interpretability of the model (James et al., 2013). However, applying these techniques directly to multi-view data ignores the grouped structure of this type of data. The specific feature-view relations are not taken into account, and valuable information may be lost (Li et al., 2016). A method called “Stacked Penalized Logistic Regression” (StaPLR) has recently been developed (Van Loon et al., 2020) which attempts to mitigates this problem by using a stacking approach.

Stacking is a method for combining multiple learners, and was first introduced by Wolpert (1992). Stacking uses a pool of learning algorithms (the base-learners) on a complete dataset. Subsequently a new algorithm (the meta-learner) is used to combine the output of the base-learners to acquire the final predictions (Sesmero et al., 2015; Van Loon et al., 2020). To use the principle of stacking on multi-view data, multi-view stacking (MVS) can be used. With MVS the base-learners are trained on each view separately instead of on the complete dataset. The meta-learner then combines these view-specific predictions (Garcia-Ceja et al., 2018; Van Loon et al., 2020). MVS can be used with a variety of different algorithms, and

(5)

there is no established standard to choose the base- and meta-learners. StaPLR uses MVS with a penalized logistic regression as algorithms for both the base- and meta-learners, which makes the interpretation of the model very straightforward. Using penalized logistic

regression in MVS makes it possible to select the most important view(s) in a multi-view dataset, while keeping the structure of the dataset intact (Van Loon et al., 2020).

A challenge with the use of StaPLR arises with missing data, as StaPLR can currently only be used on datasets without missing values. Missing data poses a problem for any type of dataset or statistical analysis, but multi-view data may specifically have some views missing completely. For example, it is likely that not all patients in a medical dataset have undergone the same data collection procedures. Using listwise deletion, the most common method to deal with missing values, on these datasets will therefore likely lead to a lot of information loss. An alternative way to deal with missing data is to substitute missing values by using imputation methods.

Imputation methods can be broadly classified as either single imputation (SI) or multiple imputation (MI). With single imputation, missing values are replaced by a value that is computed using the observed values. This is often done by either substituting the missing value with the observed mean of that variable, or by predicting the missing value with a regression equation (Kang, 2013; van Buuren, 2018). With multiple imputation, multiple copies of the incomplete dataset are created. For each of these dataset copies the missing values are replaced by a plausible data value. Often the predictive mean matching (PMM) technique is used. PMM uses the predicted values of observed cases that are most similar, and randomly chooses one value to replace the missing value. The magnitude of the difference between the imputed values in each dataset reflects the uncertainty of what the missing value should be. The analysis is then applied to all created datasets, and the resulting estimates are pooled into one final estimate (van Buuren, 2018).

Multiple imputation (MI) methods are known to produce appropriate results even in the presence of a small sample size or a high number of missing data (Josse & Husson, 2012; Van Buuren et al., 2006). However, these methods are not very accessible, and some

statistical knowledge is needed to perform the imputations, which may discourage non-statisticians from using them. In addition, the extra computation time to analyze multiple versions of the data can increase quickly for large datasets, especially with a large number of predictors, which is the case with MRI or gene data, and when more than one analysis is

(6)

performed. In contrast, single imputation methods are easier and quicker to perform, but generally not recommended because of their underestimation of variance and the increased bias (Kang, 2013; van Buuren, 2018). However, it can still be argued that single imputation can be a better option than listwise deletion, especially if the latter results in a substantial reduction of power due to a decreased sample size.

The way StaPLR handles multi-view data offers the possibility to impute the

missingness on the feature level, or to impute only the view-level predictions. This difference is visualized in Figure 1. With feature level imputations, the missing values are imputed before the analysis is performed. This is the “traditional” way of performing imputations. With view-level imputations, the base learners are trained on the observed data, and these predictions of the observed data are used to impute predictions for the missing values. A study by Liu et al., (2019) showed this level of imputation has the potential to reduce the

computational burden and prediction error compared to feature level imputation.

Imputation could mitigate the adverse effects of missing values, but it is currently unknown how different missing data methods will influence the predictive and view-selective accuracy of the StaPLR method. The aim of this study is therefore to compare the effect of four different imputation methods on the performance of the StaPLR method on an existing dataset containing items from the Mood and Anxiety Symptoms Questionnaire (MASQ). Two single imputation methods (i.e. mean substitution and Bayesian regression imputation), and two multiple imputation methods; predictive mean matching and Bayesian regression

imputation, will be compared to two benchmarks. Listwise deletion is expected to perform the worst, and will be used as a “worst case scenario”, e.g. lower benchmark, to illustrate the effects of this method. The StaPLR analysis on the complete dataset will be used as the “best case scenario”, e.g. upper benchmark. Imputations will be performed and compared on both feature and view level, with different quantities of missing data. The methods will be

compared on predictive accuracy and sparsity of the model. Multiple imputation is expected to perform well in terms of predictive accuracy, but no clear hypothesis is formed in terms of performance on model sparsity. Single imputation is expected to perform better than listwise deletion overall, especially with high quantities of missing data. No clear hypothesis is

formed for performance of single imputation compared to multiple imputation, but the hope is they will perform similar overall. If single imputation would perform as well as the multiple imputation methods, it could be used to deal with missingness in multi-view data when StaPLR is used, which reduces the computational burden multiple imputation poses.

(7)

Especially mean substitution is an accessible method to use, both in theory as in the practical appliance, and it would be advantageous if this method would be an acceptable alternative to multiple imputation. In this study I will try to answer the following questions;

1. What is the relative performance of the imputation methods, when imputing on the feature level?

2. What is the relative performance of the imputation methods, when imputing on the view level?

3. When a complete view is missing, which level of imputation yields the highest prediction accuracy and sparsest model?

(8)

Figure 1

(9)

Method

Dataset

For this study a dataset containing items from the Mood and Anxiety Symptoms Questionnaire (MASQ) (Clark & Watson, 1991) was used for a classification between

depression/dysthymia and no depression/dysthymia. The presence of depression or dysthymia was determined according to the Dutch translation of the Mini-International Neuropsychiatric Interview (MINI; Sheehan et al., 1998; Van Vliet & De Beurs, 2007). Approximately half (46.21%) of all subjects were classified as having depression/dysthymia.

This data consisted of 3157 complete cases on 77 questions. The average age was 38.68 years (SD = 13.23). The majority (62.97%) of the subjects was female, with an average age of 37.96 years (SD = 13.33). The males were slightly older (M = 39.91; SD = 12.97). Each question on the questionnaire was scored between 1 (“not at all”) and 5 (“very much”). These questions were grouped into 5 subscales:

• Anhedonic Depression (AD; 22 items), an indicator for depression • Anxious Arousal (AA; 17 items), an indicator for anxiety

• General Distress: Depression (GDD; 12 items), an indicator for nonspecific depression symptoms

• General Distress: Anxiety (GDA; 11 items), an indicator for nonspecific anxiety symptoms

• General Distress: Mixed (GDM; 15 items), an indicator for nonspecific depression and/or anxiety symptoms

All five scales can be used to predict the presence of depression/dysthymia. The correlations between the subscales are presented in Table 1. Subscale scores are the sum of all item scores that belong to that subscale. Average subscale scores are presented in Table 2. Item- and subscale distributions are presented in Figure A1.

(10)

Table 1

Correlations between subscales of the MASQ

AD AA GDD GDA GDM Anhedonic Depression - .435 .751 .631 .730 Anxious Arousal - .579 .824 .728 General Distress: Depression - .808 .896 General Distress: Anxiety - .921 General Distress: Mixed - Table 2

Average subscale score by gender

Male Female All

Anhedonic Depression 75.29 (17.17) 73.96 (17.04) 74.80 (17.13) Anxious Arousal 32.22 (12.73) 30.48 (11.86) 31.57 (12.44) General Distress: Depression 31.16 (12.53) 29.04 (11.62) 30.38 (12.24) General Distress: Anxiety 25.65 (8.75) 24.42 (8.31) 25.20 (8.61)

General Distress: Mixed 40.81 (12.96) 39.52 (12.66) 40.34 (12.86) Note. Bracketed values are SD’s.

Study design

A StaPLR analysis of the complete data was performed and was used as an upper benchmark. Next, a missing data pattern was generated, and the missing values were either imputed on feature- or view-level. To serve as a lower benchmark listwise deletion (LD) was performed. A StaPLR analysis was performed on the imputed dataset and the complete cases, and the StaPLR analysis on the imputed datasets were compared with the two benchmarks. Note that while the upper benchmark is fixed, the lower benchmark differs depending on the condition.

(11)

Missing data patterns

Several different multivariate missing data patterns were created, where random subsets of M% of features were made missing for S% of cases.

For M, levels of 25, 50 and 75 were used. For S, initial explorative testing showed no visible effects when these missing patterns were applied to less than 75% of the cases, so these missing patterns were applied to 75%, 90% and 98% of the cases.This resulted, for example, in a situation where 50% of the features were made missing for 75% of the cases.

For creating missing views, all features within V randomly chosen views were made missing. For V, levels of 1, 2 and 4 were chosen, which roughly matches the amount of missing features in the random subset conditions. These missing patterns were also applied to S% of the cases. This resulted for example in a situation where four views were made

completely missing for 90% of the cases. For these missing view(s) patterns, imputation was performed on both the feature- and view level. By doing this, the two levels of imputation could be compared with each other in addition to comparison with the benchmarks.

In total, this resulted in 3x3x3 = 27 cells of the missing-data design. While missing values for 98% of the cases seems extreme, the smallest train sample size used (n = 44) was still adequate for social science standards.

Methods for handling missing values

For each cell of the design, LD was used to compute the lower benchmark, and four imputation methods were used to handle the missing values.

Mean substitution. This single imputation method imputes the missing value of a

variable with the mean value of that variable. For feature-level imputation, the values of all observations on that variable are used to compute the mean, and this mean value is used as if it was an obtained value. For view-level imputation, the predicted values of all observations within the same view are used to compute a mean prediction value, and this mean value is used as if it was an obtained prediction value (Kang, 2013; van Buuren, 2018).

Bayesian regression. The missing value of a variable is imputed with an estimated

value. This method uses the values of other variables in a regression to compute a probable estimate for the missing value. Because the intercept, slope and standard deviation of the residuals is unknown, these parameters are estimated from the data, and therefore strongly related to the sample size (van Buuren, 2018). This parameter uncertainty is included in the

(12)

imputations, by drawing the parameters from their posterior distributions. Additionally, each estimate is augmented with a normally distributed random error with a variance equal to the variance of the regression model (van Buuren, 2018; Van Buuren et al., 2006). For imputation on a random subset of features, only the variables within the same view are used to estimate the missing value. For imputation on complete views the (prediction) values of the other views are used to estimate the missing value or prediction. This method is used twice, both as single and multiple imputation. With multiple Bayesian regression imputation, the method is performed five times, and the results are pooled by averaging all non-zero coefficients after the analysis is performed.

Predictive mean matching. As a first step, this method uses the values of other

variables in a regression to compute a set of coefficients for the whole target variable. Next, a random draw from the posterior predictive distribution is performed, producing a new set of coefficients. These randomly drawn coefficients are used to generate predicted values for all values of the target variable, both for missing values as for observed values. For each missing value, a set of observed values with similar predicted values is selected, and one observed value is randomly chosen as a substitute for the missing value (van Buuren, 2018). For imputations on a random subset of features, only the variables within the same view are used to estimate the missing value. For imputation on complete views the (prediction) values of the other views are used to estimate the missing value or prediction. This method can both be used as a single and multiple imputation method, but for this study only the multiple

imputation method is used. Five datasets are created and the results are pooled by averaging all non-zero coefficients after the analysis is performed.

Criteria

For each condition the coefficients of the selected views, AU(RO)C, sensitivity, specificity and classification accuracy were computed, and the number of selected views was calculated. The area under the curve (AUC) was computed with a fixed threshold of 0.5, and the

classification criteria were computed as 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 , 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 and 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 + 𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 2 .

(13)

For this study the average AUC and number of selected views were used as primary measures. An ANOVA was conducted on the AUC of each method, including LD, and on the number of views each method selected. With these ANOVA’s we tested whether the

differences in AUC and number of selected views between the methods depended on the percentage of missing features/amount of missing views and/or the percentage of missing cases. For the conditions with complete views missing, the effect of the level of imputation on method differences was also tested. To test whether the effect of the methods differed on average, pairwise comparisons were used. Partial 2 was used as an effect size, the suggested norms of Kirk (1996) were followed for the effect size interpretation.

Statistical analysis

A 70-30 split was used to form a training- and test set. The analyses were repeated on 25 different train-test splits, and the results were averaged into one final result for each criterium. Averaging of the coefficients of the selected views was performed on the non-zero coefficients only.

Missing data patterns

The missingness patterns were designed using function ‘ampute’ from package “mice” (v3.8.0; van Buuren & Groothuis-Oudshoorn, 2011) in R (R Core Team, 2019). The

proportion and patterns of missingness were specified according to the cells of the design described above. For the missingness mechanism the MCAR principle was used. All other default settings were sensible for this research purpose.

Feature-level imputations

Imputation on the feature level was performed prior to using the StaPLR analysis. For all imputation methods except mean substitution, function ‘mice’ of package “mice” was used (v3.8.0; van Buuren & Groothuis-Oudshoorn, 2011). Following the advice of van Buuren (2018), single Bayesian regression imputation was performed with one iteration to impute missing values. For multiple Bayesian regression and predictive mean matching, five iterations were performed within each dataset to impute missing values, with a total of five imputed datasets.

(14)

View-level imputations

To impute the missing view-level predictions, the predicted values of the complete cases in the current view must be used. Therefore, the imputation mechanism had to be implemented within the StaPLR function. First, the base-learners were trained on the observed values only. Next, the missing predictions were imputed. Lastly, the meta-learner was trained on these base predictions to compute the final predictions and select the relevant views.

StaPLR analysis

The StaPLR function of package “multiview” (v0.2.2; van Loon, 2020) was fit on the training set. StaPLR uses lambda to penalize the coefficient estimates, with higher values of lambda resulting in higher effects of shrinking the coefficients towards zero (James et al., 2013). The variables were standardized, and the ratio of largest to smallest lambda was set to 1e -04. An L2 – penalty was used on the base-learners and for the meta-learner an L1 – penalty was used. All other default settings of this function are sensible for this study and were used for the analyses. The coefficients of the selected views were checked and stored. The model was then used to make predictions on the test set, with the optimal lambdas determined from the training set.

ANOVA’s

Four ANOVA’s were performed in SPSS (v.24; IBM Corp., 2016). A mixed design model was used, both for the ANOVA’s on AUC, and the ANOVA’s on number of selected views. Separate analyses were performed on the missing features randomly conditions and for the missing views completely conditions. Pairwise comparisons using Bonferroni adjustments were used to compare the main effects of the methods with each other.

(15)

Results

Features missing randomly

The results of the ANOVA on AUC are presented in Table 3. The percentage of missing cases has a large effect (Kirk, 1996) on AUC values of the different missing value methods; p2 = .219, while the percentage of missing features does not affect the AUC values;

p2 = .004. These (non-) effects are visible in Figure 2. With listwise deletion the AUC values

decrease when more cases are missing, with a steep drop between 90% en 98% of missing cases. The AUC values of all imputation methods decrease (slightly) between 75% and 90% missing cases, and then moderately increase again between 90% and 98% missing cases. It is unclear if this is an significant effect, most likely the large effect between the percentage of missing cases and the methods is explained by the results of LD. Although no significant effect was found between percentage of missing features and the methods, the imputation methods seem to diverge when more features are missing. It is possible this drift would increase when the amount of missing features increases more. None of the methods exceeded the upper benchmark. MIreg, PMM and MS have the highest average AUC overall, and pairwise comparisons show no significant differences between these methods. LD and SIreg do perform significantly worse than the other methods.

Table 3

ANOVA on AUC for all conditions with features missing randomly

Effect df F p Partial 2

Method 1.003 41.832 < .001 .162

Method * % missing features 2.006 0.471 .626 .004

Method * % missing cases 2.006 30.221 < .001 .219 Method * % missing features * % missing cases 4.013 0.075 .990 .001

Error 216.685 - - -

(16)

The results of the ANOVA on number of selected views are presented in Table 4. The interaction between percentage of missing features and missing cases has a medium effect (Kirk, 1996) on the number of views that are selected by each method; p2 = .092. This effect

is visible in Figure 3. With listwise deletion the average number of selected views decreases when more cases are missing, with a steeper drop between 90% en 98% of missing cases. This trend is visible regardless of the amount of missing features. All imputation methods select the same number of views on average when 25% or 50% of features are missing, regardless of percentage of missing cases. However, when 75% of features are missing SIreg, MIreg and PMM increase the average amount selected views when more cases are missing, while MS remains fairly stable and close to the upper benchmark. LD differs significantly from all other methods in average number of view selection. The amount of missing features and/or missing cases have the least effect on MS, who performs most similar to the

benchmark and significantly different from the other methods. No significant differences were found between SIreg, MIreg and PMM.

Table 4

ANOVA on number of selected views for all conditions with features missing randomly

Method 2.459 384.196 < .001 .640

Method * % missing features 4.918 67.930 < .001 .386 Method * % missing cases 4.918 55.353 < .001 .339 Method * % missing features * % missing cases 9.836 5.473 < .001 .092

Error 531.155 - - -

(17)

Figure 2 and 3 shows the performance of LD worsens when the amount of missing cases increases. While it is not necessarily bad that less views are selected with LD, the associated steep drop in average AUC shows this sparse model does perform significantly worse. While the multiple imputation methods perform well in terms of AUC, the steep increase in number of views selected makes it hard to get information about the importance of the individual views when more data is missing.

Table 5 presents the averaged estimated coefficients of the StaPLR analyses, and the number of times each individual view was selected. Only the three conditions with 75% missing features are displayed here. The detailed StaPLR results, including average AUC, accuracy, sensitivity and specificity for all conditions with features missing randomly are presented in Table B1 and B2. The upper benchmark clearly favours the depression indicator views (e.g. AD, GDD and GDM), with GDM not selected in only 3 of the 25 splits

performed. While all methods also favour these views when the amount of missing cases are relatively low, MS keeps this structure intact regardless of amount of missing cases. The structure also stays relatively intact with LD, though less views are selected overall when missingness increases. The anxiety views (e.g. AA and GDA) are progressively selected with SIreg, MIreg and PMM when missingness increases. However, a clear difference in

coefficient size between the depression and anxiety views can be seen with MIreg and PMM, with the coefficients of the anxiety views on average much closer to zero. This indicates the lasso and/or ridge penalty was higher on these views compared to the depression views. With MS this distinction is also visible, while SIreg and LD lose this apparent difference.

(18)

Figure 2

Average AUC for mean substitution (MS), single Bayesian regression (SIreg), multiple Bayesian regression (MIreg), predictive mean matching (PMM) and listwise deletion (LD) for all conditions with features missing randomly

Figure 3

Average number of selected views for mean substitution (MS), single Bayesian regression (SIreg), multiple Bayesian regression (MIreg), predictive mean matching (PMM) and listwise deletion (LD) for all conditions with features missing randomly

(19)

Table 5

Averaged coefficients for the benchmarks and each imputation method for all condition with 75% features missing randomly

Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching 75% missing cases AD 3.824 (0.176) 3.770 (0.578) 4.210 (0.220) 3.515 (0.192) 3.501 (0.142) 3.507 (0.138) AA - - - .457 08 (0.302) .166 10 (0.098) .149 09 (0.099) GDD 1.553 (0.237) 1.262 22 (0.705) 1.606 (0.313) 2.025 (0.253) 1.890 (0.174) 1.846 (0.162) GDA - .777 01 (-) - .482 08 (0.248) .130 07 (0.094) .107 10 (0.099) GDM .391 22 (0.209) .979 22 (0.596) 1.453 (0.290) 1.891 (0.230) 1.677 (0.211) 1.695 (0.178) 90% missing cases AD 3.824 (0.176) 3.037 (1.179) 3.979 (0.203) 3.553 (0.126) 3.533 (0.100) 3.525 (0.099) AA - .977 03 (1.347) .415 02 (0.525) 1.084 18 (0.492) .315 22 (0.207) .326 22 (0.201) GDD 1.553 (0.237) 2.108 22 (1.153) 1.946 (0.357) 2.355 (0.243) 2.039 (0.219) 1.986 (0.174) GDA - .034 01 (-) .028 02 (0.022) .876 17 (0.373) .182 21 (0.225) .264 19 (0.204) GDM .391 22 (0.209) .854 13 (0.722) 1.580 (0.335) 2.125 (0.242) 1.765 (0.243) 1.791 (0.184) 98% missing cases AD 3.824 (0.176) 3.218 19 (1.118) 3.640 (0.188) 3.621 (0.173) 3.501 (0.118) 3.468 (0.117) AA - 1.425 02 (1.132) .398 03 (0.317) 1.145 22 (0.487) .415 23 (0.287) .386 24 (0.226) GDD 1.553 (0.237) 1.820 09 (1.167) 1.981 (0.354) 2.596 (0.211) 2.196 (0.210) 2.149 (0.201) GDA - 1.955 01 (-) .210 03 (0.181) 1.009 20 (0.461) .347 21 (0.249) .359 22 (0.213) GDM .391 22 (0.209) .960 04 (1.217) 1.648 (0.246) 2.355 (0.193) 1.891 (0.159) 1.889 (0.157)

Note. Subscript for selected views indicate times that view was selected (e.g. x08_{= view selected in 8} splits). When no subscripts is shown the view was selected for all 25 splits. Bracketed values are SD’s.

(20)

Views missing completely

The results of the ANOVA on AUC are presented in Table 6. The interaction between percentage of missing views and missing cases has a small effect (Kirk, 1996) on AUC values of the different missing value methods; p2 = .026, while the interaction between percentage

of missing cases and imputation level has a large effect (Kirk, 1996) on AUC values; p2 = .121. These effects are visible in Figure 4. When the missing features are imputed on

feature-level, the average AUC value decreases with LD when more cases are missing. A steep drop can be seen between 90% en 98% of missing cases, especially when 1 or 2 views are missing. However, when the missing features are imputed on view-level, the average AUC value remains more or less stable for LD between 75% and 90% missing cases. The AUC values drop when 98% cases are missing, but far less extreme compared to feature-level imputations. When 4 views are missing the performance with LD even exceeds the other imputation methods. While LD performs significantly different from all other methods, it is not clear if these AUC values are significantly higher than the other methods. This difference could only be due to the large AUC drops of LD in the feature-level imputation conditions. There are no large visible differences in AUC between the levels of imputation for the other methods. When 2 views are missing, a decrease between 75% and 90% missing cases, and increase between 90% and 98% missing cases is seen for all imputation methods with feature-level imputation. This effect seems to be related to feature-feature-level imputations only, as a similar effect was seen in Figure 2 when features were missing randomly, but not visibly present with view-level imputations. The AUC value decreases when more views are missing, regardless of imputation level, and the methods start to diverge. None of the methods exceeded the upper benchmark. MIreg, PMM and MS have the highest average AUC overall, and pairwise

comparisons show no significant differences between these methods. SIreg performs significantly worse than the other methods, and LD also differs significantly. However, the performance of LD is very dependent on the imputation-level, so while LD performs worse overall, it could potentially perform similar or better when the effects of the methods are compared with view-level imputation only.

(21)

Table 6

ANOVA on AUC for all conditions with views missing completely

Method 1.010 35.256 < .001 .075

Method * missing views 2.020 19.983 < .001 .061

Method * % missing cases 2.020 46.985 < .001 .179 Method * imputation level 1.010 36.710 < .001 .078 Method * missing views * % missing cases 4.041 2.935 .020 .026 Method * missing views * imputation level 2.020 1.546 .214 .007 Method * % missing cases * imputation level 2.020 29.735 < .001 .121 Method * missing views *

% missing cases * imputation level 4.041 1.503 .200 .014

Error 436.378 - - -

Note. bold effect sizes indicate large effects (Kirk, 1996).

The results of the ANOVA on number of selected views are presented in Table 7. The interaction between the amount of missing views and missing cases has a small effect (Kirk, 1996) on the number of views that are selected by each method; p2 = .022. The interaction

between percentage of missing cases and the imputation level also has a small effect (Kirk, 1996) on the number of views selected; p2 = .022. These effects are visible in Figure 5. With

LD the average number of selected views decreases when more cases are missing. This trend is visible regardless of the amount of missing views. The drop between 90% en 98% missing cases is steep with feature-level imputations, and moderate with view-level imputations. When 1 view is missing all imputation methods select the same number of views, regardless of imputation level. However, a slight divergence of MS is visible when more cases are missing, especially with feature-level imputation. With 2 views missing the number of views selected increases when more cases are missing. This increase is very steep with MS, and slightly with SIreg, MIreg and PMM. When 4 views are missing all methods select all views (most of the time), regardless of imputation level and percentage of missing cases. LD differs significantly from all other methods in average number of view selection. The amount of missing features and/or missing cases have the most effect on MS, which diverges completely

(22)

from the other methods when 2 views are missing and performs significantly different from the other methods overall. No significant differences were found between SIreg, MIreg and PMM.

Table 7

ANOVA on number of selected views for all conditions with views missing completely

Method 2.110 1386.464 < .001 .762

Method * missing views 4.220 227.962 < .001 .513 Method * % missing cases 4.220 60.387 < .001 .218 Method * imputation level 2.110 13.626 < .001 .031 Method * missing views * % missing cases 8.440 2.399 .013 .022 Method * missing views * imputation level 4.220 2.002 .088 .009 Method * % missing cases * imputation level 4.220 4.887 .001 .022 Method * missing views *

% missing cases * imputation level 8.440 1.134 .337 .010

Error 911.565 - - -

Note. bold effect sizes indicate large effects (Kirk, 1996).

Figure 4 and 5 shows the performance of LD worsens when the amount of missing cases increases. With feature-level imputation the performance of LD on AUC is clearly worse compared to the imputation methods. However with view-level imputation LD performs similar, and possibly better when the amount of missing views increases. The multiple imputation methods perform well overall in terms of AUC, and the number of views selected only starts to increase rapidly when 4 views are missing. There is no clear best method overall, though it is clear that MS performs significantly worse than the other methods in terms of stability of view selection.

The averaged estimated coefficients of the StaPLR analyses, and the number of times each individual view was selected are presented in Table 8 for feature-level imputation, and in Table 9 for view-level imputation. Only the three conditions with 4 missing views are displayed here. The detailed StaPLR results, including average AUC, accuracy, sensitivity

(23)

and specificity for all conditions with views missing completely are presented in Table C1 and C2 for feature-level imputation, and Table D1 and D2 for view-level imputation. LD is the only method that still favours the depression views over the anxiety views when 4 views are missing, though less views are selected overall when missingness increases. With view-level imputation slightly more views are selected on average for LD compared to feature-view-level imputation. All other methods select (almost) all views regardless of imputation level.

Interestingly the coefficients of SIreg, MIreg and PMM are overall notably smaller when view-level imputation is performed. This does not seem to be the case for MS. No consistent coefficient size differences are seen between the depression and anxiety views with any method.

Figure 4

Average AUC for mean substitution (MS), single Bayesian regression (SIreg), multiple Bayesian regression (MIreg), predictive mean matching (PMM) and listwise deletion (LD) for all conditions with views missing completely.

(24)

Figure 5

Average number of selected views for mean substitution (MS), single Bayesian regression (SIreg), multiple Bayesian regression (MIreg), predictive mean matching (PMM) and listwise deletion (LD) for all conditions with views missing completely.

(25)

Table 8

Averaged coefficients for the benchmarks and each imputation method for all feature-level imputation conditions with 4 views missing

Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching 75% missing cases AD 3.824 (0.176) 3.733 (0.653) 3.413 (0.215) 3.039 (0.296) 2.734 (0.315) 2.646 (0.329) AA - 1.137 01 (-) 1.110 22 (0.379) 1.935 (0.660) 1.277 (0.428) 1.049 (0.387) GDD 1.553 (0.237) 1.450 24 (0.874) 2.260 (0.367) 2.489 (0.320) 1.918 (0.444) 1.797 (0.475) GDA - - .897 23 (0.421) 1.600 (0.617) 1.001 (0.379) .878 (0.434) GDM .391 22 (0.209) .951 15 (0.808) 2.078 (0.294) 2.168 (0.470) 1.595 (0.381) 1.616 (0.413) 90% missing cases AD 3.824 (0.176) 3.676 (0.899) 4.118 (0.247) 3.543 (0.417) 3.126 (0.306) 2.974 (0.317) AA - 1.936 02 (0.170) 2.368 (0.552) 3.163 (0.533) 2.153 (0.456) 1.900 (0.418) GDD 1.553 (0.237) 1.396 20 (0.881) 3.389 (0.250) 3.363 (0.400) 2.789 (0.358) 2.775 (0.358) GDA - .972 02 (0.696) 2.266 (0.490) 2.897 (0.504) 2.098 (0.433) 1.738 (0.536) GDM .391 22 (0.209) 1.329 13 (0.496) 3.204 (0.247) 3.059 (0.516) 2.391 (0.380) 2.282 (0.327) 98% missing cases AD 3.824 (0.176) 2.948 16 (1.834) 4.878 (0.226) 4.125 (0.455) 4.069 (0.306) 3.959 (0.296) AA - 1.246 01 (-) 3.934 (0.398) 3.906 (0.521) 3.596 (0.408) 3.355 (0.529) GDD 1.553 (0.237) 2.544 10 (1.572) 4.414 (0.235) 3.583 (0.553) 3.776 (0.438) 3.781 (0.354) GDA - .637 02 (0.479) 3.942 (0.258) 3.678 (0.763) 3.397 (0.486) 3.269 (0.521) GDM .391 22 (0.209) 1.619 09 (1.169) 4.434 (0.236) 3.623 (0.648) 3.608 (0.532) 3.534 (0.413)

(26)

Table 9

Averaged coefficients for the benchmarks and each imputation method for all view-level imputation conditions with 4 views missing

Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching 75% missing cases AD 3.824 (0.176) 3.745 (0.522) 3.390 (0.243) 1.034 (0.268) 1.478 (0.163) 1.479 (0.159) AA - - .978 21 (0.503) .554 17 (0.235) .690 (0.235) .643 (0.245) GDD 1.553 (0.237) 1.430 23 (0.556) 2.331 (0.304) 1.046 (0.335) 1.058 (0.268) .881 (0.249) GDA - .389 03 (0.522) .889 22 (0.392) .655 23 (0.281) .640 (0.241) .661 (0.255) GDM .391 22 (0.209) 1.108 15 (0.534) 2.112 (0.314) 1.554 (0.299) 1.006 (0.260) 1.069 (0.239) 90% missing cases AD 3.824 (0.176) 3.906 (0.879) 4.171 (0.254) .817 (0.172) 1.133 (0.122) 1.083 (0.152) AA - 2.110 02 (1.687) 2.395 (0.416) .589 23 (0.295) .850 (0.204) .754 (0.173) GDD 1.553 (0.237) 1.318 20 (0.631) 3.290 (0.203) .951 (0.227) .845 (0.153) .840 (0.159) GDA - .853 01 (-) 2.247 (0.530) .717 24 (0.470) .791 (0.173) .837 (0.190) GDM .391 22 (0.209) .717 15 (0.470) 3.207 (0.348) 1.241 (0.233) 1.139 (0.179) 1.107 (0.128) 98% missing cases AD 3.824 (0.176) 3.309 23 (1.870) 5.014 (0.166) .789 (0.191) .900 (0.107) .878 (0.106) AA - 4.568 01 (-) 3.997 (0.350) .747 (0.302) .883 (0.148) .778 (0.131) GDD 1.553 (0.237) 1.702 14 (1.155) 4.452 (0.204) .827 (0.175) .872 (0.079) .910 (0.108) GDA - 1.739 01 (-) 3.957 (0.230) .882 (0.296) .838 (0.156) .861 (0.150) GDM .391 22 (0.209) 1.014 09 (0.710) 4.584 (0.196) 1.021 (0.168) .963 (0.122) .988 (0.087)

(27)

Discussion

In this thesis mean substitution (MS), single Bayesian regression (SIreg), multiple Bayesian regression (MIreg) and predictive mean matching (PMM) were compared to an upper and lower benchmark in terms of predictive accuracy and sparsity in different missing data scenarios. For the conditions with features missing randomly, listwise deletion (LD) and SIreg performed the worst overall. MS and the multiple imputation methods perform the best on average AUC, and MS performs best in terms of stability of view selection. For the conditions with views missing completely and feature-level imputation, LD and SIreg again performed worst in terms of AUC, and MS performed worst in terms of stability of view selection. There is no clear best method overall. MS and the multiple imputation methods perform the best on average AUC, and SIreg, MIreg and PMM on view selection up until 2 views are missing. With 4 views missing LD is the only method that still discriminates between different views. For the conditions with views missing completely and view-level imputation, LD performs similar to the multiple imputation methods, and possibly better when missingness increases. There is no clear best method overall. SIreg, MIreg and PMM perform best on view selection up until 2 views are missing. With 4 views missing LD is the only method that still discriminates between different views. Performance on predictive accuracy and sparsity is dependent on the missingness condition. No clear superior method to deal with missing values regardless of missing data scenario was found.

With increasing percentages of features missing randomly, we see all imputation methods increase the amount of selected views, except MS which remains stable. Even though more views are selected by these methods, the average AUC drops. An explanation could be found in that all imputation method except MS add noise to their imputed values to simulate the uncertainty of the imputed value. It seems plausible that when the amount of missingness increases, the imputation methods can no longer distinguish between the added noise and the observed signal and the view selection ability is diminished. This theory could also explain the view selection stability of MS, as the observed signal is strengthened by the repeated use of the mean and no noise is added. The views selected with MS show no inflation of the coefficients (see Table 4), which can be a problem with mean substitution (Kang, 2013; van Buuren, 2018). In addition, no significant differences were found between MS, MIreg and PMM in terms of predictive accuracy. These results, and noting the fact that MS is computational much faster that any multiple imputation method, seem to favour MS as method to deal with missing data when features are missing randomly. These results do

(28)

depend heavily on the design of this study, and it is entirely possible that with different levels of missingness the results would change. However, the performance with MS in these

conditions was better than expected, and it would be interesting to see whether this result could be replicated within different circumstances.

When views are completely missing MS does not perform as well, regardless of level of imputation. For these conditions the amount of selected views increases with MS before the other imputation methods stop favouring the depression views. With MS all cases within a missing view receive the exact same imputed (predicted) value(s) with mean substitution. For example, when 90% of the cases has 2 missing views, 90% of the dataset has the exact same (predicted) value(s) for those 2 views, and only 10% of the data has variability on these views. It seems plausible that when the majority of the dataset is identical on multiple views, the view selection ability suffers accordingly. This effect is not visible with random missing features, because the amount of cases that have missing values on exactly the same features is much smaller and these missing features are randomly distributed over multiple views instead of within one view.

Overall LD is not influenced by the amount of missing features, because all cases with missing features are removed regardless of quantity. However, when the missingness is applied to more cases, less signal is available due to a smaller sample size, and less views are selected. This results in a (steep) drop in AUC when the amount of missing cases increases, which is visible in each condition, though much less so when view-level imputation is used. In the view level imputation conditions, LD uses all observed datapoints for the base-learners, and afterwards removes the incomplete cases, whereas in the feature-level imputation

conditions the incomplete cases are removed before the base-learners are trained. Therefore more information is used for training the base-learners in view imputation, even though the same views were made missing. This result, though unexpected, could be an elegant way to reduce the information loss without the added computational strain of multiple imputation. The average AUC with LD is (slightly) lower than the other methods when 1 or 2 views are missing for 98% of the cases, which can be explained by the overall lower number of views selected in these conditions. However, when 4 views are missing the AUC values of LD is higher with only around 2 views selected, while the other methods have lower AUC values with all 5 views selected. This decrease in computational time, both in lack of the (multiple) imputation burden, as in the sparsity of the model, may be worth the slightly lower predictive accuracy under certain circumstances.

(29)

In this study MIreg an PMM both performed well in terms of AUC values, regardless of the missing data scenarios. However, the computational strain of multiple imputation poses a real problem for large datasets and/or when multiple analyses are performed. The results of this study show MIreg and PMM do not perform well with large amounts of missing data when model sparsity is required. In addition, the computational burden to analyse all different missing data scenarios was increased exceedingly due to these methods. Part of the

computational burden could easily be lifted by using view-level imputation when views are missing completely. The prediction accuracy and sparsity of the model did not differ between feature- and view-level imputations for any of the imputation methods used, and LD

performed notably better. In addition this study shows that for certain conditions where features are missing randomly, and for certain conditions where views are missing completely, MS and LD with view-level imputations could potentially be acceptable alternatives to multiple imputation.

The focus of this study has been the predictive accuracy, measured with AUC, and the view selection ability of the StaPLR method with different missing data methods. Due to the abundance of results, it was not possible to compare the methods on all criteria, or on the absolute values of the coefficients of the selected views. The results of this study are specific for this dataset and the StaPLR method, and cannot directly be inferred to other datasets or view-selection methods. The effects of the different missing data methods were only found when the missingness patterns were applied to 75% or more of the cases. When less cases were missing, no differences were found between LD and any imputation method on neither average AUC or view selection ability. It is not entirely clear whether this was a result of the percentage of missingness or due to the sample size of this dataset. However it is likely that a similar study on a smaller dataset would yield different results, possibly with smaller amounts of missingness.

The design of this study made it possible to impute missing features based only on the features within the same view, when features were missing randomly. This was obviously not possible for SIreg, MIreg and PMM when complete views were missing. For these imputation methods the (prediction) values from other views were used to impute the missing value(s). When comparing the effects of the different imputation levels for these methods, there seems to be no difference in the number of views selected, nor in the specific views that were selected. This could be explained by the high correlations between the subscales in this dataset (Table 2), so it is likely that there would have been a difference if the views did not

(30)

correlate this highly. A replacement strategy would be to look at the correlations between the individual features, and impute missing values based on features that correlate highly.

Because of the design of this study, the dataset was first split into train and test set, and the missingness pattern was applied to the train set only. While the imputation methods retained the ratio of 70% train- and 30% test set, for LD the training set became (much) smaller than the test set. While this could possibly contribute to the steep drop in average AUC between 90% and 98% missing cases in every condition, on average the AUC values with only 2% of the data available are still remarkably high.

The next step for determining the best way to handle missing values in multi-view data would be to repeat this study on a dataset where the different views are collected from

different sources, which is common in biomedical research. Most likely the correlations between views from different sources would be less high, and it would be informative to see the effect that has on the StaPLR method. The StaPLR method is currently being updated to include discoveries of interactions and/or non-linearities, which could make it more accurate in determining the best views for prediction. It would also be advantageous if the possibility to train the base-learners on all observed values is implemented in the method, as for this method this was done separately.

While multiple imputation is a useful technique, the statistical knowledge and computational strain may discourage people from using it properly. When views are completely missing, the use of listwise deletion can be a acceptable method when the base-learners are trained on all observed cases. When an imputation method is preferred, it is advisable to use view-level imputations, as the computational time is much faster and similar results are found. When features are missing randomly, which makes view-level imputations impossible, MS could potentially be an acceptable method under certain conditions.

(31)

References

Clark, L. A., & Watson, D. (1991). Tripartite model of anxiety and depression: psychometric evidence and taxonomic implications. Journal of abnormal psychology, 100(3), 316.

Garcia-Ceja, E., Galván-Tejada, C. E., & Brena, R. (2018). Multi-view stacking for activity recognition with sound and accelerometer data. Information Fusion, 40, 45–56.

https://doi.org/10.1016/j.inffus.2017.06.004

IBM Corp. Released 2016. IBM SPSS Statistics for Windows, Version 24.0. Armonk, NY: IBM Corp.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning. New York, United States: Springer Publishing.

John, O. P., & Srivastava, S. (1999). The Big-Five trait taxonomy: History, measurement, and theoretical perspectives. In L. A. Pervin & O. P. John (Eds.), Handbook of personality: Theory and research (Vol. 2, pp. 102–138). New York: Guilford Press.

Josse, J., & Husson, F. (2012). Handling missing values in exploratory multivariate data analysis methods. Journal de la Société Française de Statistique, 153(2), 79-99.

Kang H. (2013). The prevention and handling of the missing data. Korean journal of anesthesiology, 64(5), 402–406. https://doi.org/10.4097/kjae.2013.64.5.402

Kirk, R. E. (1996). Practical significance: A concept whose time has come. Educational and psychological measurement, 56(5), 746-759.

Li, Y., Wu, F.-X., & Ngom, A. (2016). A review on machine learning principles for multi- view biological data integration. Briefings in Bioinformatics, bbw113.

https://doi.org/10.1093/bib/bbw113

Liu, X., Zhu, X., Li, M., Tang, C., Zhu, E., Yin, J., & Gao, W. (2019). Efficient and Effective Incomplete Multi-View Clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 4392–4399. https://doi.org/10.1609/aaai.v33i01.33014392

R Core Team (2019). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.

(32)

Sesmero, M. P., Ledezma, A. I., & Sanchis, A. (2015). Generating ensembles of heterogeneous classifiers using Stacked Generalization. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1), 21–34.

https://doi.org/10.1002/widm.1143

Sheehan, D. V., Lecrubier, Y., Sheehan, K. H., Amorim, P., Janavs, J., Weiller, E., ... & Dunbar, G. C. (1998). The Mini-International Neuropsychiatric Interview (MINI): the development and validation of a structured diagnostic psychiatric interview for DSM-IV and ICD-10. The Journal of clinical psychiatry.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

van Buuren, S. (2018). Flexible Imputation of Missing Data, Second Edition (2de editie). Abingdon, Verenigd Koninkrijk: Taylor & Francis.

Van Buuren, S., Brand, J. P. L., Groothuis-Oudshoorn, C. G. M., & Rubin, D. B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12), 1049–1064.

https://doi.org/10.1080/10629360600810434

Van Buuren, S., Groothuis-Oudshoorn, K. (2011). “mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software, 45(3), 1- 67.

https://www.jstatsoft.org/v45/i03/.

Van Loon, W. (2020). multiview: Methods for High-Dimensional Multi-View Learning (multiview). R package version 0.2.2.

Van Loon, W., Fokkema, M., Szabo, B., & de Rooij, M. (2020). Stacked penalized logistic regression for selecting views in multi-view learning. Information Fusion, 61, 113- 123. https://doi.org/10.1016/j.inffus.2020.03.007

Van Vliet, I. M., & De Beurs, E. (2007). The MINI-International Neuropsychiatric Interview. A brief structured diagnostic psychiatric interview for DSM-IV en ICD-10 psychiatric disorders. Tijdschrift voor psychiatrie, 49(6), 393.

Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5(2), 241–259. https://doi.org/10.1016/s0893-6080(05)80023-1

(33)

Zhao, J., Xie, X., Xu, X., & Sun, S. (2017). Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38, 43–54.

(34)

Appendix A

Figure A1

(35)

(36)

Appendix B

Table B1

Averaged coefficients for the benchmarks and each imputation method for all conditions with features missing randomly Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching

25 % missing features for 75 % of the cases

AD 3.824 (0.176) 3.719 (0.515) 3.819 (0.221) 3.676 (0.238) 3.719 (0.176) 3.698 (0.169) AA - - - -GDD 1.553 (0.237) 1.700 23 (0.560) 1.409 (0.312) 1.526 (0.288) 1.468 (0.207) 1.465 (0.206) GDA - .444 01 (-) - - - - GDM .391 22 (0.209) .579 16 (0.545) .746 (0.274) .738 (0.256) .751 (0.172) .761 (0.167) 50 % missing features for 75 % of the cases

AD 3.824 (0.176) 3.827 (0.823) 3.899 (0.200) 3.567 (0.167) 3.642 (0.137) 3.623 (0.116) AA - - - - .145 01 (-) .001 01 (-) GDD 1.553 (0.237) 1.342 24 (0.667) 1.345 (0.237) 1.591 (0.270) 1.488 (0.174) 1.485 (0.152) GDA - - - - GDM .391 22 (0.209) .943 19 (0.563) 1.107 (0.246) 1.200 (0.227) 1.120 (0.194) 1.113 (0.139) 75 % missing features for 75 % of the cases

AD 3.824 (0.176) 3.770 (0.578) 4.210 (0.220) 3.515 (0.192) 3.501 (0.142) 3.507 (0.138) AA - - - .457 08 (0.302) .166 10 (0.098) .149 09 (0.099) GDD 1.553 (0.237) 1.262 22 (0.705) 1.606 (0.313) 2.025 (0.253) 1.890 (0.174) 1.846 (0.162) GDA - .777 01 (-) - .482 08 (0.248) .130 07 (0.094) .107 10 (0.099) GDM .391 22 (0.209) .979 22 (0.596) 1.453 (0.290) 1.891 (0.230) 1.677 (0.211) 1.695 (0.178)

(37)

Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching

AD 3.824 (0.176) 3.666 (0.971) 3.909 (0.222) 3.803 (0.173) 3.807 (0.165) 3.802 (0.172) AA - 1.846 02 (0.390) - - .089 01 (-) - GDD 1.553 (0.237) 1.568 20 (1.030) 1.334 (0.294) 1.324 (0.257) 1.393 (0.227) 1.365 (0.224) GDA - - - - GDM .391 22 (0.209) .995 14 (0.984) .723 (0.234) .839 (0.228) .742 (0.219) .766 (0.191) 50 % missing features for 90 % of the cases

AD 3.824 (0.176) 3.796 (0.859) 3.867 (0.128) 3.564 (0.132) 3.615 (0.132) 3.623 (0.105) AA - 1.255 03 (0.713) - .025 01 (-) .115 01 (-) .073 02 (0.084) GDD 1.553 (0.237) 1.135 19 (0.677) 1.368 (0.263) 1.607 (0.237) 1.551 (0.169) 1.519 (0.190) GDA - 1.133 02 (0.005) - - - - GDM .391 22 (0.209) 1.422 13 (0.769) 1.133 (0.264) 1.349 (0.249) 1.172 (0.167) 1.180 (0.141) 75 % missing features for 90 % of the cases

AD 3.824 (0.176) 3.037 (1.179) 3.979 (0.203) 3.553 (0.126) 3.533 (0.100) 3.525 (0.099) AA - .977 03 (1.347) .415 02 (0.525) 1.084 18 (0.492) .315 22 (0.207) .326 22 (0.201) GDD 1.553 (0.237) 2.108 22 (1.153) 1.946 (0.357) 2.355 (0.243) 2.039 (0.219) 1.986 (0.174) GDA - .034 01 (-) .028 02 (0.022) .876 17 (0.373) .182 21 (0.225) .264 19 (0.204) GDM .391 22 (0.209) .854 13 (0.722) 1.580 (0.335) 2.125 (0.242) 1.765 (0.243) 1.791 (0.184)

(38)

Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching

AD 3.824 (0.176) 3.813 15 (1.338) 3.916 (0.146) 3.793 (0.171) 3.769 (0.158) 3.788 (0.133) AA - .879 02 (0.085) .256 01 (-) - .106 02 (0.059) .059 02 (0.056) GDD 1.553 (0.237) 2.516 13 (1.040) 1.214 (0.244) 1.283 (0.248) 1.335 (0.167) 1.276 (0.193) GDA - 1.656 03 (2.007) - - - - GDM .391 22 (0.209) 1.528 10 (1.526) .864 (0.242) .916 (0.255) .865 (0.197) .883 (0.234) 50 % missing features for 98 % of the cases

AD 3.824 (0.176) 2.807 18 (1.620) 3.730 (0.138) 3.532 (0.187) 3.561 (0.116) 3.540 (0.138) AA - 2.070 01 (-) - .341 01 (-) .114 01 (-) .076 02 (0.004) GDD 1.553 (0.237) 2.688 08 (1.612) 1.397 (0.265) 1.603 (0.231) 1.575 (0.228) 1.537 (0.229) GDA - 2.910 02 (0.588) - - - .042 01 (-) GDM .391 22 (0.209) 2.767 06 (1.238) 1.213 (0.351) 1.397 (0.337) 1.232 (0.281) 1.273 (0.288) 75 % missing features for 98 % of the cases

AD 3.824 (0.176) 3.218 19 (1.118) 3.640 (0.188) 3.621 (0.173) 3.501 (0.118) 3.468 (0.117) AA - 1.425 02 (1.132) .398 03 (0.317) 1.145 22 (0.487) .415 23 (0.287) .386 24 (0.226) GDD 1.553 (0.237) 1.820 09 (1.167) 1.981 (0.354) 2.596 (0.211) 2.196 (0.210) 2.149 (0.201) GDA - 1.955 01 (-) .210 03 (0.181) 1.009 20 (0.461) .347 21 (0.249) .359 22 (0.213) GDM .391 22 (0.209) .960 04 (1.217) 1.648 (0.246) 2.355 (0.193) 1.891 (0.159) 1.889 (0.157)

(39)

Table B2

Averaged results for the benchmarks and each imputation method for all conditions with features missing randomly Upper benchmark LD / Lower benchmark Mean substitution Single regression Multiple regression Pred. mean matching

Accuracy .759 (.012) .752 (.011) .756 (.012) .754 (.010) .755 (.011) .756 (.012)

Sensitivity .717 (.022) .718 (.030) .742 (.017) .728 (.018) .728 (.018) .730 (.019)

Specificity .801 (.017) .787 (.027) .770 (.020) .780 (.018) .782 (.019) .782 (.019)

AUC .830 (.011) .826 (.009) .827 (.010) .827 (.010) .828 (.010) .828 (.010)

Accuracy .759 (.012) .751 (.011) .757 (.010) .754 (.011) .754 (.011) .755 (.011)

Sensitivity .717 (.022) .720 (.036) .747 (.017) .721 (.022) .721 (.022) .722 (.023)

Specificity .801 (.017) .782 (.031) .767 (.020) .786 (.019) .787 (.022) .787 (.022)

AUC .830 (.011) .823 (.011) .826 (.011) .825 (.011) .826 (.011) .826 (.011)

Accuracy .759 (.012) .752 (.013) .757 (.012) .752 (.012) .753 (.011) .752 (.012)

Sensitivity .717 (.022) .720 (.029) .759 (.020) .726 (.019) .721 (.022) .718 (.023)

Specificity .801 (.017) .785 (.034) .756 (.022) .778 (.022) .785 (.022) .787 (.022)

AUC .830 (.011) .825 (.014) .826 (.013) .823 (.013) .825 (.013) .825 (.013) 25 % missing features for 90 % of the cases

Accuracy .759 (.012) .742 (.012) .752 (.011) .751 (.011) .752 (.010) .752 (.011)

Sensitivity .717 (.022) .704 (.038) .730 (.017) .715 (.018) .716 (.018) .716 (.018)

Specificity .801 (.017) .781 (.033) .774 (.015) .787 (.015) .787 (.016) .788 (.016)