Instability of QUalitative INteraction Trees: Quantifying uncertainty in decision trees

(1)

Master’s Thesis Psychology,

Methodology and Statistics Unit, Institute of Psychology Faculty of Social and Behavioural Sciences, Leiden University Date: October 6th 2019

Student number: s2289423

Supervisor: Dr. Elise Dusseldorp, Dr. Marjolein Fokkema

Instability of QUalitative INteraction Trees

Quantifying uncertainty in decision trees

(2)

Acknowledgements

First and foremost, I’d like to thank my thesis supervisor Dr. Elise Dusseldorp for her guidance and patience throughout the work process of my Master’s thesis. Her advice helped me to improve and make me more confident about my statistical skills.

Secondly, I am very grateful about the feedback and comments from all my proof-readers who helped to improve this thesis with their critical notes about the language, comprehensibility and logical argumentation. Besides, they were always available for emotional and motivational support.

I furthermore would like to express my gratitude towards my parents who made this additional year of studying in Leiden possible. I knew I could always count on them and and I know that they are always there for me, on both good and bad days.

Finally, I want to thank the entire M&S team of Leiden University. I had an exciting and challenging year which I am very proud of. I learned so much and I enjoyed being pushed to my limits. The personal relationship with the instructors, the personal feedback and the well-organized structure of this Masters made this year perfect for me.

(3)

Abstract

Qualitative Interaction Trees (QUINT) are based on decision trees to explore subject-treatment interactions in patient groups. Their particularity is that they focus on qualitative interactions compared to quantitative interaction trees. Similar to other statistical learning methods that aim to keep a certain level of interpretability they suffer from a lack of stability which might result in unstable partitioning and assignment of the patients. In our research, the detection and quantification of instability are examined in more detail. Two studies were conducted to investigate two different aspects of stability and trustworthiness, the inclusion of bootstrapped confidence intervals in QUINT and the usefulness of two stability coefficients.

Study 1 is about the implementation of a bootstrapping procedure in order to assess confidence intervals around the difference in means to quantify the uncertainty of the final grouping of QUINT. Study 2 compares two indices that measure the stability of a tree model regarding their sensitivity towards changes in the partitioning and structure of the models. The first index is Cohens Kappa which assesses the class agreement or semantic stability, and the second is called region compatibility (RC) measure. The latter focuses on the assignment of subjects to single leaves and is assumed to be also sensitive towards structural similarities and differences between decision trees. The results showed that a bootstrapping approach tended to provide more conservative estimations and wider confidence intervals than the naïve method. Bootstrapped results are usually preferred because their results are more honest. The two stability coefficients showed a medium high correlation. Both reacted to structural differences but Kappa distinguished better between stable and unstable models.

(4)

Table of Content

Acknowledgements ... 2

Abstract ... 3

1. Introduction ... 6

2. Theoretical Framework of Study 1 ... 10

2.1. The QUINT algorithm ... 10

2.2. Uncertainty measured by confidence intervals ... 11

2.3. Study 1 - The bootstrap-approach from Loh et al. (2015) ... 12

3. Theoretical Framework of Study 2 ... 14

3.1. Semantic vs. Structural Stability ... 14

3.2. Overview of approaches to measure stability ... 14

3.3. Study 2 - Stability Estimation ... 16

3.3.1. Resampling methods and Evaluation overlap ... 16

3.3.2. The region compatibility measure ... 16

3.3.3. Cohen’s Kappa as measure for Class Agreement ... 18

4. Monte Carlo Simulation Study... 20

4.1. Study 1 ... 20

4.2. Study 2 ... 22

5. Results ... 24

5.1. Evaluation of Study 1 ... 24

5.1.1. Effect of method ... 24

5.1.2. Effects of other factors ... 25

5.1.3. The role of the leaves ... 26

5.1.4. Discussion ... 28

5.2. Evaluation of Study 2 ... 29

5.2.1. Missing Data ... 30

5.2.2. Effects of the Resampling Procedure and Learning Overlap ... 30

5.2.3. Structural Stability ... 31 5.2.4. Discussion ... 33 6. General Discussion ... 36 7. Conclusion ... 37 References ... 39 Appendix ... 42

(5)

List of Tables and Figures

Figure 1 QUINT Tree Example ... 8

Figure 2a and 2b Results Study 2 Model and Sample Size ... 32

Figure 3 Results Study 2 - Structure and Coefficients ... 33

Table 1 Results Study 1 - Overview ... 25

Table 2a and 2b Results Study 1 – Differences between Leaves ... 26

Table 3 Results Study 1 – Comparing the bootstraps ... 28

(6)

1. Introduction

With upcoming research in individual differences and more advanced analysis methods, such as statistical learning approaches, the research into treatment effectiveness for mental disorders has been changing from aiming to find the overall best treatment for a diagnosis to finding the best treatment for a single patient depending on her or his characteristics and disease history. Recently, more and more complex machine and deep learning models have been used to predict the effectiveness for individuals based on (tracked) behavioural data, gene or fMRI data (Heinsfeld, Franco, Craddock, Buchweitz, & Meneguzzi, 2018; Yang, Liu, Sui, Pearlson, & Calhoun, 2010). However, such models are extremely complex and it is almost impossible to interpret them and to understand how they came to the results they provide. This may be a problem as soon as these models are used in practice by people who do not have a statistical background, for example doctors or psychotherapists. They have an interest in understanding why they should do what the model tells them, so interpretability is essential for the acceptance and appliance (Molnar, 2018).

A method which performs well but is also easy to interpret are single decision trees. Commonly, decision trees are based on a recursive partitioning algorithm which can be visualized with a tree structure and they can be used for both – regression and classification problems. Regression trees have a continuous or discrete dependent variable and their performance is commonly evaluated by the squared error between predicted and actual values. Conversely, classification trees consist of a dependent variable, which can take on a value of two or more classes. The prediction error is respectively quantified as the misclassification rate (Loh, 2011).Such single trees can be easily visualized and the patient assignment retraced, which makes them a popular method for therapists and doctors. Applied to a new sample of patients which were treated by two different therapy approaches, the respective model can be used to assign future patients to the leaves and the results of the method mimic a decision process of a doctor to find the best treatment based on patient characteristics.

Besides algorithms which aim to classify objects into two or multiple groups or predict the value of the dependent variable such as CART (Classification and Regression Trees - Breiman, 1998) or C4.5 (Quinlan, 2014) there are tree methods designed for special purposes. One of these purposes is to investigate whether a treatment’s effect differs between patients by

(7)

detecting patient-treatment interactions in clinical studies. The most favourable output of a therapy is clearly that overall a positive effect can be seen and the patients’ symptoms assigned to the respective treatment are verifiably reduced. Nevertheless, it happens that there is no valuable effect visible for a group of patients or -the worst possibility - that a therapy results in a significant deterioration of the patients’ mental health. As mentioned before, the problem is, that the outcome of one therapy does not have to be the same for all patients but can vary based on specific individual characteristics or the history of the diagnosis. Such differences in subject-treatment interaction can be quantitative and qualitative (Byar, 1985; Lubin, 1961). Quantitative or ordinal interaction means that the effect of the therapy differs in its size between patients but it is still possible to make a general statement which therapy approach is best. Algorithms which were developed for such cases are – amongst others – SIDES (Lipkovich, Dmitrienko, Denne, & Enas, 2011) and GUIDE (Loh, 2002). Unfortunately, this does not hold true anymore when qualitative or disordinal subject-treatment interaction occurs. In this case one treatment is superior to another for a subgroup of patients and vice versa.

To assist practical researchers gaining awareness about potential differences of treatment effectiveness in their datasets Dusseldorp and van Mechelen (2014) invented Qualitative Interaction Trees (QUINT) which offer a clear visual represented method to detect subject-treatment interactions. This statistical learning method is able to detect also higher (e.g. three-way interactions) and non-linear interactions compared to traditional methods. Using the basic approach of a sequential regression tree QUINT enables practitioners to identify the most important patient characteristics for the assignment of a patient to one or another treatment. The algorithm aims to find subgroups which are substantially different, meaning that it attempts to find considerably large groups which show a clear interaction effect. As an outcome QUINT represents three different types of leaves (s. Figure 1). The effect of the therapy for the respective patient subgroups can differ due to the type of therapy they received and one of the approaches can be superior in terms of average treatment outcome or there is no difference and both therapies are equally suitable for the specific subgroup. The method marks the leaves in colours dependent on the interaction effects. If treatment 1 is superior to treatment 2 the leaf will be presented in green, if it is the other way around the leaf appears red. In case that the two therapy groups do not differ the respective leaf is grey.

(8)

Figure 1 QUINT tree – Model 1 including all three possible classes and the respective mean differences

Decision trees are common methods when the main interest is in interpretability of the method. However, they are also known for being rather instable compared to other methods (Kuhn & Johnson, 2013; Turney, 1995). Even though there are a couple of studies which show that single decision trees are not more unstable than other explainable methods such as OLS or lasso regression (Frick, Strobl, & Zeileis, 2014) they cannot compete with more complex models, (e.g., multi-layer, bagged or boosted models). Commonly, instability is defined as large changes in the model coefficients that appear if only a small change occurs in the data (Breiman, 1996b). For decision trees, this means that minor changes in the dataset might cause a different choice of the splitting variable and its cutpoint at any step of the partitioning procedure (Philipp, Zeileis, & Strobl, 2016). Stability and related the reproducibility are important properties for a statistical tool (see e.g., Stodden, 2015; Turney, 1995). Davison and Hinkley (2009, p.1) even go a step further when they say that “the explicit recognition of uncertainty is central to the statistical sciences.” Measuring stability or uncertainty of our model gives an idea of how trustworthy future predictions will be.

If a tree is fitted to a dataset without stopping criteria, it is likely that the tree grows very large, becomes extremely complex in terms of interpretation and shows a poor performance on a new dataset, also known as overfitting. A common way to reduce complexity and overfitting of a single tree to increase its stability is to use a regularization technique. For example, cost-complexity pruning approaches the optimal depth of a tree by using the size as a penalty parameter for the optimization algorithm (Breiman, 1998; Quinlan, 1987). Nevertheless, pruned trees still have a substantial amount of instability when they are applied to new datasets, therefore, also pruning is not a complete remedy. Another way to deal with instability of single trees and to improve their predictive accuracy at the same time is to use bagged or boosted

(9)

procedures instead where a lot of trees are fitted (Breiman, 1996a; Schapire & Freund, 2012). However, these methods are far more computationally expensive and lack interpretability what makes them less favourable with regards to the practical application.

Due to these reasons, QUINT mainly bases on a single decision tree approach instead. As mentioned before, the main benefits are their simplicity and practical use. Nonetheless, these advantages go along with some drawbacks. Compared to other predictive models, single decision trees tend to have a poorer performance precisely because of this simplicity. The way of partitioning the predictor space often prevents trees from finding the optimal solution and the number of potential predicted outcomes is limited by the number of leaves which is often smaller than for other types of models (Kuhn & Johnson, 2013). These slight downsides are commonly known but accepted due to the dominating advantages for some situations. One way to deal with the uncertainty and instability of these methods is to attempts to quantify their extent in order to frame the interpretation of the output.

More specifically, instability in tree algorithms can be caused by extreme leaves with a small number of subjects assigned to them, the amount of noise in the data, in case of interaction trees also in multiple independent interactions in the data and highly correlated explanatory variables. Especially the latter is not easy to handle for decision trees. It weakens the strength of trees to ignore variables that are not used in any split and therefore intrinsically conducts a kind of feature selection. Having high correlations between predictors makes the choice for one of them at one specific split point rather random (Kuhn & Johnson, 2013).

As QUINT is an algorithm for tree models the same weaknesses are valid. QUINT offers the possibility to prune the final decision tree back to get a less overfitted model using the ‘bias correction bootstrap procedure’ (Leblanc & Crowley, 1993) to find the model with the least biased results. But bias is not the only source of potential error, also correlations and noise in the data might lead to unstable models. Besides, sample and effect size seem to have a not negligible influence on the average class agreement, an increase leads to a better differentiation and a higher class agreement (Dusseldorp & van Mechelen, 2014).

The current study examines in how far uncertainty and instability in QUINT trees can be assessed and evaluated. Therefore, two different approaches were chosen resulting in two different research questions. For each of them an independent study was performed. The first study aimed to compare two ways of assessing confidence intervals for the estimated difference

(10)

in means in the leaves of a tree in order to quantify the uncertainty of the group effect per leaf. Study 2 focussed on assessing the stability of the structure of QUINT trees by comparing two stability coefficients regarding their strengths and weaknesses. The uncertainty and instability measures will be described in more detail in the following section.

The results from this study may provide valuable insight into how to quantify uncertainty and instability in Qualitative Interaction Trees and give the user information about how trustworthy the model is which was found based on the respective dataset. The remainder of this thesis is organized as follows: in the next sections an overview is given of the methods and measures that were used for the 2 studies. In section 4, the design of the applied Monte Carlo simulation studies is explained. In section 5, the results are presented followed by a discussion and conclusion in the last section.

2. Theoretical Framework of Study 1

2.1. The QUINT algorithm

QUINT is a sequential partitioning algorithm which aims to optimize a combined partitioning criterion. The difference of treatment outcomes for each leaf which can either be specified as the difference in means or the standardized effect size Cohen’s d should be maximized as well as the number of subjects assigned to each leaf. It stops growing the tree as soon as one of the stopping criteria is reached which are namely if the maximum of the global partitioning criterion C is found, if there is no disordinal interaction in the data, if there is a (pre-defined) minimal number of subjects per treatment in the leaves reached, if a user-specified number of leaves is reached and if the partition classes 1 and 2 are not empty (Dusseldorp & van Mechelen, 2014). In a second step, QUINT offers the possibility to prune the tree back to avoid overfitting of the training sample. Therefore it uses a bias-corrected pruning technique (Efron, 1983; Leblanc & Crowley, 1993) which calculates an averaged value for the overly optimistic model and subtracts it from the actual global partitioning criterion.

For each leaf QUINT estimates the difference between treatment means, for instance the unstandardized effect size, and the related standard error which can be used to calculate naïve confidence intervals. However, these naïve estimates which result from the partitioning

(11)

algorithm are often too optimistic and overestimate the effect size while underestimating the standard error. So does QUINT as it tends to overestimate the effect sizes in the leaves. Dusseldorp and van Mechelen (2014) show that especially for smaller sample sizes and small standardized effect sizes below 0.30 the type I error, the wrong specification of a present qualitative interaction effect, was fairly large, about 0.15.

2.2. Uncertainty measured by confidence intervals

At the current implementation, QUINT 2.0.1. provides the difference of treatment means between the two therapy groups for each leaf or, if desired, the standardized effect size as well as its respective standard error (SE). The SE of the difference in means is calculated as the square root of the pooled standard deviation for both treatment groups divided by the sum of subjects in both groups per leaf. As a measure of uncertainty naïve two-sample intervals might be calculated for the effect size in each leaf. Their validity, however, is questionable due to the complex procedure which is used to optimize the nodes of a tree (Loh, He, & Man, 2015). Finding these subgroups without a-priori hypotheses about the effects of potential moderator variables, decision trees suffer from the same weaknesses as all post-hoc procedures and can result in over-fitted models or even the detection of spurious, not really existing, effects (Wang, Lagakos, Ware, Hunter, & Drazen, 2007). In case of QUINT the naïve estimation of the standard error of the outcome differences and the related confidence intervals are also assumed to be overestimated or biased since they are derived from the partitioning procedure itself (ter Avest et al., 2019).

Common methods for estimating error in terms of rule-based learning algorithms are resampling-methods such as cross-validation and bootstrapping (Efron & Tibshirani, 1997). The accuracy of confidence intervals which are received from these resampled data are dependent on the number of samples which are drawn but also on the extent of similarity between the sampled and the original distribution (Davison & Hinkley, 2009). Compared to cross-validation the bootstrap is slightly superior when it comes to variability of the prediction error. Furthermore, it immediately delivers a measure for the variability of a point estimate (Banjanovic & Osborne, 2016; Efron & Gong, 1983) and the probability of finding the effect again in a replication study.

Loh et al. (2015) also proposed a bootstrapping approach to sample the effect distribution and find more precise 95% confidence intervals which can give more evidence about the uncertainty

(12)

of the effect size’s estimation. Compared to their naïve estimates the bootstrap leads to more accurate intervals that have a higher coverage of the true values. Such an approach has not been implemented for QUINT yet.

2.3. Study 1 - The bootstrap-approach from Loh et al. (2015)

Loh et al. (2015) proposed a bootstrap approach for estimating confidence intervals for the difference in means in the leaves of interaction trees. The authors compare multiple interaction trees but exclude QUINT from their analyses because they do not have any qualitative interactions in their simulated models and at that time QUINT was not yet able to deal with categorical variables. For the remaining models, they drew a number of bootstrap samples and for each sample a new tree was grown. Then, the assignment of the subjects was compared to the original tree built on the entire sample. A detailed description of the procedure can be seen in the following section Algorithm 1.

Algorithm 1: Bootstrapping Confidence Intervals (Cis) Input: original dataset 𝓛

Output: lower and upper boundary of the bootstrapped confidence interval

1) Construct a tree T with |T | leaves t and a naïve estimator of the mean outcome

μ̂(t,z) for each treatment z ϵ {0,1} and leaf t on the original data 𝓛 2) Draw a bootstrap sample 𝓛b* (b = 1, 2,…,B)

for b in 1 : B

a. Construct tree Tb* on this sample with naïve estimate μ̂b*(t*, z) for |Tb* | leaves

t* in Tb* and treatment z

b. b. For all leaves t of T and t* of Tb* create a cross-table with rowsi (i = 1,2,…, |Tb* |) and columnsj (j = 1,2,…,|T |) separately for each treatment z so that in each cell Cij of the cross-table the intersection nz(tj⋂ti*) is the number of patients in the particular leaf of the bootstrapped tree and the original tree

(13)

c. For each cell Cij multiply the number of patients and the naïve estimate for leaf

ti* : 𝑛_𝑧(𝑡𝑗∩ 𝑡𝑖∗) ∙ μ̂𝑏∗(𝑡𝑖∗, 𝑧)

d. For each leaf t add up all product terms per treatment group z and divide by the sum of the number of patients in all |T |leaves of T to get estimate 𝜇̅_𝑏∗_{(𝑡, 𝑧)}

𝜇̅_𝑏∗(𝑡, 𝑧) = ∑ 𝑛𝑧 𝑡∗ (𝑡 ∩ 𝑡∗_{) μ̂} 𝑏 ∗_(𝑡∗_{, 𝑧)/ ∑ 𝑛} 𝑧 𝑡∗ (𝑡 ∩ 𝑡∗₎

e. For each leaf t calculate the difference between the means of both treatments: 𝑑𝑏

∗

(𝑡) = 𝜇̅_𝑏∗(𝑡, 1) − 𝜇̅_𝑏∗(𝑡, 0)

3) Compute the sample variance of {𝑑1 ∗

(𝑡), 𝑑₂∗(𝑡), … , 𝑑_𝑏∗(𝑡)} which is s²d(t) and

compute the 95% confidence interval of d(t): d̂(𝑡) ± 1.96 𝑠𝑑(𝑡) where d̂(𝑡) is the estimated difference of means μ̂(t,1) - μ̂(t,0) for each leaf

The number of patients for all leaves in T does not have to be equal to the number of patients assigned to leaf tj because of the bootstrapping procedure where r observations are drawn with replacement. That means that one person can appear multiple times in the bootstrap sample. The size of r is one of the manipulated factors (s. section 4.1.). Note that the last step was slightly changed. Loh et al. (2015) calculate their confidence intervals by using 2*sd(t) instead

of 1.96*sd(t). However, there is no explanation given for this choice. Therefore, we decided to

go with the traditional approach and to make sure that the bootstrap and the naïve confidence intervals are comparable.

The confidence intervals of the naïve estimator for each leaf t and treatment z are calculated by: 𝑑̂𝑛𝑎𝑖𝑣𝑒(t) ± 1.96 sdiff(t) where 𝑑̂𝑛𝑎𝑖𝑣𝑒(𝑡) is the difference in means calculated by QUINT for

each leaf between the treatments and sdiff(t) is the corresponding standard error:

𝑠𝑑𝑖𝑓𝑓(𝑡) = √ 𝑠𝑑(𝑡)𝑝𝑜𝑜𝑙 1 𝑛1+ 1 𝑛2

where sd(t)pool is the pooled standard deviation for both treatment groups in one leaf t and n1 and n2 are the respective sample sizes. The code for this approach can be found in Appendix A.

(14)

Hypothesis Study 1

The bootstrap approach from Loh et al. (2015) was implemented for QUINT. It is expected that the confidence intervals which are received by this method show a higher coverage of the true value than the naïve estimates do.

3. Theoretical Framework of Study 2

3.1. Semantic vs. Structural Stability

Calculating the means of each patient subgroup, QUINT is clearly a regression tree approach. However, the assignment of patients to three different types of groups (P1, P2 and P3) could also

be considered as an unsupervised classification task. Then it might be interesting whether a patient is always assigned to the same class when multiple trees are built on the data or not. Moreover, the stability of the chosen variables and their split points can be of interest. These two stability considerations are usually defined as semantic and structural stability of a decision tree (Briand, Ducharme, Parache, & Mercat-Rommens, 2009; Turney, 1995). In case of decision trees, semantically stable trees result in the same partitioning groups regardless of their structure, which means that for example one patient is always assigned to the same therapy concept even though the splitting variables or the split points might differ. In contrast, structurally stable trees also show the same splits, so structural stability is a sufficient criterion for semantic stability but not the other way around (Wang, Li, Yu, & Liu, 2018). However, it should be kept in mind that two trees which show a different structure can still lead to an equivalent partitioning and the same interpretation (Turney, 1995).

3.2. Overview of approaches to measure stability

In recent years, multiple approaches have been proposed to measure semantic and structural stability of tree models. A classical way to measure semantic stability in case of a classification task is to calculate the ratio of cases assigned to the same classes, when running two trees on the same dataset (Dwyer & Holte, 2007; Turney, 1995). Slightly different, Paul, Verleysen and Dupont (2012) introduced class prediction stability to estimate the extent to which a number of re-sampled datasets assign subjects to the same class. Philipp, Rusch, Hornik and Strobl (2018)

(15)

propose a couple of more general distance measures for quantifying semantic stability of different predictive learning methods independently of the resampling and evaluation method: Euclidean distance or the Gaussian radial basis function in case of regression problems, the average class agreement (ACA), the Kappa statistic and a probabilistic similarity estimation for classification algorithms.

One method, which focuses on the structure of a tree only is tree edit distance. It determines the number of steps that are necessary to transform one tree to another (Bille, 2005). However, such a calculation is computationally very expensive and - as mentioned before – does not take into account that trees which look differently can still result in the same partitioning of the variable space. Further approaches aim to split the tree in decision regions which consist of single nodes or leaves in order to determine split-point and variable selection stability (Briand et al., 2009; Dwyer & Holte, 2007). Wang et al. (2018) introduced a fairly new measure which they call region compatibility (RC) and which is based on evidence theory. This theory was originally introduced by Dempster (1968) and Shafer (1976) and connects information from different sources to one common probability estimation whereby the contrary of the probability that an event occurs is not seen as its complement but only as a case when we do not know if the event occurs or not. Therefore, it is also known as the “theory of belief functions” (Shafer, 1990). In relation to decision trees this means that two trees which seem to differ structurally might still have common patterns with other trees generated from the same dataset (Wang et al., 2018). For this reason, the authors try to combine a structural perspective with the idea of semantic stability by assessing the patterns of the decision regions two trees have in common which they assume is more sensitive towards instability than only testing for the equivalence of the decision regions. Comparing their RC measure to class agreement and region stability they can show that RC is superior in distinguishing between decision trees. The RC approach seems to be promising but has not been proven yet in any other study.

(16)

3.3. Study 2 - Stability Estimation

In the following section, the resampling methods used for study 2 are described in detail. Furthermore, the two methods which were used for the stability estimation are outlined.

3.3.1. Resampling methods and Evaluation overlap

In regards to stability measurements of supervised statistical learning methods, Philipp et al. (2018) compared multiple resampling methods and evaluation methods. They recommend using the bootstrap as resampling method because it captures stability rather well if the bootstrap sample size is chosen around 0.9 times the original sample size. A sample with replacement of this size had a learning overlap of around 35%. In contrast, subsampling, which was a resampling approach without replacement, had a higher learning overlap and often leads to overestimations. Therefore it was decided to take bootstrapping as the resampling method of choice instead of resampling without replacement. Furthermore, the authors of this study showed that the larger the learning sample size the less instable are the estimations. Out-of-bag (OOB) and all-in (ALL) evaluation leaded to the least biased and least variable results. For ALL two trees were built on two samples drawn from the original dataset and the entire dataset was used to evaluate on these trees. For OOB, in each run of the resampling the samples were only drawn from part of the dataset and the remaining cases were hold back as the evaluation set which was used to compare the two trees constructed from the two samples.

3.3.2. The region compatibility measure

Wang et al., (2018) proposed two region compatibility methods as a combined measure for structural and semantic stability. Both of them, RCID and RCJAC are based on distance measures from evidence theory. The second measure (RCJAC, also called only region compatibility measure in the following sections), which was the more sensitive one in their analysis and had more interesting properties, makes use of the Jaccard index as distance measure for two trees and was used for this study as well. Apart from the choice of the similarity measure Wang’s region compatibility differs from the classical average class agreement (ACA) by not comparing the assignment to the classes but to the single leaves. A detailed version of the implementation of the region compatibility measure can be found in Algorithm 2, the respective R code is attached in Appendix B.

(17)

Algorithm 2 – Region compatibility measure Input: Original dataset 𝓛

Output: RCJAC

1) Pre-define a number of bootstrap repetitions B (b = 1, 2, …, B ) for b in 1 : B

a. Draw two samples DbA* and DbB* from the original dataset 𝓛 and build the respective trees TbA and TbB.

b. For each of the two trees we contain a set of decision regions F. For each of the decision regions – or leaves - R ϵ F we determine the basic probability

assignment m by: m(R)𝑇_𝑏 = {|𝑅| |𝐷|⁄ 0 𝑖𝑓 𝑅 ∈ 𝐷 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒.

where D is the dataset and |D | represents the sample size N.

c. The non-zero basic probability assignments are sorted by size in form of a column vector 𝑽(𝑏) = [𝑚𝑏(𝑅1), 𝑚𝑏(𝑅2), … , 𝑚𝑏(𝑅|𝑇|)]𝑡𝑟 where |T | is the number of leaves for the respective tree so that 0 < mb(R1.1) < mb(R1.2) < … <

mb(R1.|T|). All entries are compared between the two vectors mA and mB of, respectively, TbA and TbB. Entries which include the same subjects are excluded. Vector mA – mB is created by concatenating mA and – (mB). d. A n x p Jaccard matrix is created where the number of rows (n ) and the

number of columns (p ) is equal to the length of vector mA-mB and each cell contains the Jaccard index Wij for the two decision regions which is calculated by:

𝑾𝑖,𝑗 =

|𝐸𝑙𝑒𝑚𝑒𝑛𝑡𝑠(𝑖) ∩ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡𝑠(𝑗)| | 𝐸𝑙𝑒𝑚𝑒𝑛𝑡𝑠(𝑖) ∪ 𝐸𝑙𝑒𝑚𝑒𝑛𝑡𝑠(𝑗)|

where i and j define the position of the decision region in the concatenated vector of 𝐹_𝑇_𝑏𝐴 and 𝐹_𝑇_𝑏𝐵 (identical decision regions are excluded).

Correspondingly, all entries of the diagonal of matrix Wij are 1.

e. The final region compatibility index RC*Jac for b is calculated by RC𝐽𝑎𝑐,𝑏∗ = √(𝒎𝑨 − 𝒎𝑩)𝑡𝑟𝑾(𝒎𝑨 − 𝒎𝑩)

(18)

2) Final estimator 𝑅𝐶𝐽𝑎𝑐 =

∑𝐵_𝑏=1𝑅𝐶_{𝐽𝑎𝑐,𝑏}∗ 𝐵

Note: In order to avoid confusion a transposed matrix is marked by ‘tr’ because T as well as t were already used for labelling trees and their leaves

3.3.3. Cohen’s Kappa as measure for Class Agreement

Cohen’s Kappa (Cohen, 1960) is a coefficient to measure agreement between multiple categories and was originally invented for the field of interrater reliability but it can also be used for agreement between multiple instances in general. In case of stability measurement, Philipp et al. (2018) recommend Kappa for measuring the stability of class assignment in decision trees instead of the ACA for data with imbalanced classes. Since this occurs fairly often in the average clinical dataset which is analysed by using QUINT we decided to take Cohen’s Kappa as measure of the semantic stability. Algorithm 3 gives an overview of how Kappa was used to assess leaf agreement in case of interaction tree models.

Algorithm 3 – Cohen’s Kappa Input: Original dataset 𝓛 Output: Kappa

1) Pre-define a number of bootstrap repetitions B (b = 1, 2, …, B) for b in 1 : B

a. Draw two bootstrap samples DbA* and DbB* from the original dataset 𝓛 and build the respective trees TbA and TbB.

b. Create a congruence matrix C with the intersection of the assignment of the subjects to the classes (P1, P2, P3) between both trees when they are run on the same new dataset. The number of rows and columns equals the number of classes |P |.

c. P0 is the proportion of observed accordance and is determined by adding all cells in the diagonal which contain the compliant cases divided by the number of all cases N:

(19)

𝑝0 =

∑𝑑𝑖𝑎𝑔(𝑪) 𝑁

d. The calculation of the proportion of random accordance takes place in two sub-steps. For each class the number of subjects in the margin cells are divided by n and multiplied and added to the coefficients of the other classes subsequently. 𝑝𝑒= 1 𝑁2∑ (∑ 𝐶𝑗 𝑖𝑗 ∙ ∑ 𝐶𝑗 𝑗𝑖) |𝑃| 𝑖=1

e. Calculate coefficient Kappa for the pair of trees: 𝜅∗= 𝑝0− 𝑝𝑒

1 − 𝑝𝑒

2) Average across all pairs of bootstrap samples to obtain the final coefficient 𝜅:

𝜅 = ∑ 𝜅 ∗ 𝐵 𝑏=1

𝐵

Dusseldorp and van Mechelen (2015) also evaluated the quality of their models by the true assignment of subjects to the three partitioning classes using Cohen’s Kappa. It can be seen that the size of the Kappa coefficient depended on the effect size, the sample size and the complexity of the model. For a very large effect the other two factors almost did not matter. For a medium sized effect and a tree model which had more than one split, Kappa varied between .13 and .67 depending on the sample size.

The categorisation of Kappa as good or not good is somewhat ambiguous. Landis and Koch (1977) propose a guideline where values between 0 and 0.2 indicate a slight agreement, 0.21 – 0.4 a fair agreement, 0.41 – 0.6 points to a medium, 0.61 – 0.8 for a substantial and 0.81 – 1 for an almost perfect agreement. In contrast, Fleiss (1973) classifies values below 0.4 as poor, values between 0.4 and 075 as fair to good and all values above 0.75 as excellent. Nevertheless, Gwet (2010) criticises these guidelines as arbitrary and not very helpful. The classification approaches were used for a rough orientation of our values.

(20)

Hypotheses Study 2

Cohen’s Kappa and the region compatibility measure from Wang et al. (2018) are both calculated for models with different levels of stability manipulated using noise and sample size. The assumption is, that the region compatibility measure by Wang et al. (2018) is more sensitive towards instability than Kappa and therefore reflects the reliability of a tree model better. It is expected that the RC measure is able to capture structural instability such as a different number of leaves, their classes, the different splitting variables and their split points better than Kappa does. The region compatibility measure is therefore expected to show more clearly reaction to these structural characteristics of the tree models.

4. Monte Carlo Simulation Study

A number of Monte Carlo Simulations were conducted to answer the research questions. Two separated and independent simulations were run for study 1, whose main objective is to implement a new bootstrapping procedure in QUINT, and study 2, which aims to assess the stability of QUINT trees.

4.1. Study 1

In the first simulation study, the implemented methods were compared for two true tree models. True model 1 was based on model b from the paper of Dusseldorp and van Mechelen (2014), the other one was extracted from Sies and van Mechelen (2017) and slightly adapted by Chen (2019) so that it included all three instead of only two partitioning classes of QUINT.

The structure of TRUE model 1 can be seen in Figure 1. It contains two splitting variables, X2 and X4 and respectively, two split points resulting in three leaves, one of each possible class. The artificial datasets contained simulated N observations of a continuous outcome variable Y and a binary treatment variable T. Moreover, four of the five covariates X1, X2, X4 and X5 were added, all sampled from a multivariate normal distribution with μx fixed on the values c(10, 30, -40, 70). For X3, μx was drawn from a discrete uniform distribution on the interval [-70 – 70]. The standard deviation for all variables was fixed to 10. The distribution of the outcome variable Y was based on the true binary tree structure of the model. For each leaf and condition it was

(21)

normally distributed with a SD of 5. The choice of the split points (see Fig. 1) leads to an imbalanced assignment of the subjects to the leaves.

The second model was constructed based on the original model from Sies and van Mechelen (2017):

μ(A, X) = 1.0 + 0.25X₁+ 0.25X₂− 0.25X₅− c[A − g_opt(X)]², where c is the effect size and the true optimal treatment regime 𝑔_𝑜𝑝𝑡(𝑋) = 𝐼(𝑋₁ >

−0.433)𝐼(𝑋₂ < 0.219). Chen (2019) adjusted this function slightly to introduce a leaf with the indifferent treatment group of Quint by adding gP3 to the formula:

μ(A, X) = 1.0 + 0.25X₁ + 0.25X₂ − 0.25X₅ − c[1 − g_P3(X)][A − g_opt(X)]²,

where 𝑔𝑃3(𝑋) = 𝐼(𝑋1 > −0.433)𝐼 (𝑋2 ⩾ 0.219).

In contrast to model 1, this model aims to distribute the same number of patients to all leaves which represented the three classes P1, P2 and P3.

For the original tree T two parameters of the QUINT function were adjusted. Instead of taking the default partitioning criterion (Cohen’s d), crit = ‘dm’ (difference in means) was chosen to obtain the unstandardized difference in treatment means. Furthermore, the control function was used to limit the number of leaves to three. Thus, a comparison of the calculated values with the true model is ensured.

Furthermore, the following factors were fully crossed: True model (M): The two models described above.

Effect size in standardized mean differences (es): d = 0.5 (medium), d = 0.8 (large), d = 1 (very large)

Overall sample size (N): In regards to real datasets the overall sample size was chosen as small, medium and large:

(22)

Besides, the bootstrap sample size r and the related overlap between the original sample and the training sample was varied to examine the influence of the reliability of the results. If r = N the average overlap is 40% (Philipp, Rusch, Hornik, & Strobl, 2018). The variation of r resulted in three different methods: r = 1 for the first, r = 0.9 for the second and r = 0.8 for the third method. The number of replications for each design combination was k = 50. In total, 2 x 3 x 3 = 18 different design combinations were examined per method.

Following the study from Loh et al. (2015) for each scenario 100 bootstrap samples were drawn in order to get a stable distribution. Inspired by Dusseldorp and van Mechelen (2014) no small effect size was included because Quint has difficulties to detect these. Regarding the sample size, a rather small size was compared to larger ones. It is assumed that the small sample size causes some instability whereas a size of 1000 should lead to an almost perfectly stable model (Dusseldorp & van Mechelen, 2014).

The results were evaluated using the width of the confidence intervals and the coverage of the true effect size value comparing the naïve estimates with the confidence interval boundaries received by the bootstrapping procedure. Additionally, it was compared whether subjects would be assigned to the same group partition, meaning that it was assessed, for instance, if subject 1 was assigned to a red leaf in both trees.

4.2. Study 2

To examine the stability of tree models, model 1 from study 1 was chosen as the rather stable basic model. In addition, two more models were chosen which are assumed to be more unstable due to manipulations in the data generating process. For both of them, noise was added to the data with a different noise-signal-proportion to make one more unstable than the other. It means that to a particular amount of data some randomly generated noise was added using the formula from Hastie, Tibshirani, & Friedman (2017) for the signal-to-noise-ratio:

𝑉𝑎𝑟(𝑓(𝑋)) 𝑉𝑎𝑟(𝜀)

Where Y = f(X) + ε is the given model. Accordingly, normal distributed and random error was simulated for all dataset columns whose variance was then used to calculate the signal-to-noise ratio as it can be seen above. Finally, the error terms were added to the original data. Again, the partitioning criterion was set to crit = ‘dm’.

(23)

Additionally, the following two design factors were fully crossed:

True model: A stable model 1, one model 2 with a signal-to-noise ratio of 0.1 (slightly unstable) and one model 3 with a signal-to-noise ratio of 0.2 (unstable). The correlation between the datasets for model 1 and 2 was around .95 and for model 1 and 3 around .91

Overall sample size: small, medium and large sample:

N = 200 (100 per group), N = 500 (250 per group), N = 1000 (500 per group)

Since we had learned that the effect size (es) only had a minor influence on the results the standardized difference of means was fixed to 0.8. Again, the bootstrap sample size was varied using two variations of r (1, 0.9). Additionally, we used two different sampling procedures: out-of-bag (OOB) and all in-sample (ALL). In case of OOB, the evaluation set included the cases which were not part of the bootstrap sample. Overall, this resulted in four different methods and 3 x 3 = 9 factor combinations. The number of replications for each design cell was k = 90. For study 2 the overall sample size might also be a cause of instability regardless of the manipulation of the data. Determining the amount of noise we followed the study from Dietterich (2000) who examined the effect of noisy data on bagged and boosted ensemble decision trees.

To evaluate the results we checked how well the instability of the tree is identified by the measure of instability. Thus, if the measure changes depending on the data manipulation of the model and the sample size which are assumed to cause instability. Furthermore, the corelation between the two instability measures should be investigated. Besides, we intended to relate the size of the coefficients to some structural indicators such as number of leaves, the partitioning class of the leaves, their chosen split variables and splitting points. We supposed to see a higher structural congruence represented by at least the region compatibility measure.

(24)

5. Results

5.1. Evaluation of Study 1

To evaluate the results of study 1 three measures were taken as dependent variables: firstly, the width of the confidence intervals for each of the four methods, secondly, the coverage of the true model values for each leaf and thirdly, whether the true group effect was found and the related colour of the leave would have been chosen correctly based on the confidence intervals. The latter means, that a leaf was correctly specified as green if the lower boundary of the confidence interval was greater than zero and therefore treatment group 1 responded better, as grey if zero was included so that there was no difference between the two treatment groups and as red if the upper boundary was lower than zero so that a negative effect could be seen. For each of the three dependent variables, a mixed design ANOVA was run with 2 repeated measures, method and leaf, and sample size, effect size and model as between factors. Leaf was added as factor to examine whether there were differences depending on the type of expected group difference. A Helmert contrast for the within factor method was specified to test the naïve method against the bootstrap methods. The factors and interactions which had a large partial eta square (> 0.06) are explored in more detail. To compare the data of both models, the results of model 1 were divided by its SD of 5 so it has the same scale as model 2. Additionally, the leaves were sorted in the same way so that potential differences are not caused because of a different direction of the effect.

5.1.1. Effect of method

Overall, clear discrepancies could be seen between the four methods. Taking only the main effect of method into account, the naïve intervals (mean width = 0.58) were less broad than the bootstrapped intervals (mean width = 1.45) and the factor ‘method’ explained most of the variance (partial η²= 0.94). A bit less but still substantial was the influence from method on the coverage of the true value (partial η²= 0.58) where the naïve intervals covered on average approximately 60% of the time the true model value compared to 88% if a bootstrap approach was used. In contrast, the correct detection of the group effect was only marginally influenced by the method (partial η²= 0.012) with the naïve method having a slightly higher value (75 % vs. 73%).

(25)

The Helmert contrast which tested the naïve method against all three bootstrapped methods was clearly significant. If the naïve approach was excluded from the analysis the importance of the factor method shrunk immensely for width (partial η² = 0.32) and coverage (partial η² = 0.02) but not for the group differences (partial η² = 0.008). This indicated that the three bootstrap methods which differ only in sample size and in the related learning overlap with the original sample led to similar results.

5.1.2. Effects of other factors

A part of the variation of the interval width was explained by the sample size (partial η² = 0.84) and the model (partial η² = 0.68) as well as its interaction (partial η² = 0.16) whereas the effect size did not have a substantial influence. The model x sample size interaction did not disappear when the naïve approach was excluded from the analysis (partial η² = 0.23). The coverage of the true value was slightly influenced by the sample size (partial η² = 0.08) and effect size (partial η² = 0.07) regardless of the inclusion or exclusion of the naïve approach. The same applied to the specification of the group effect which showed a moderate influence of all three between-factors with and without the naïve method. An overview is given in Table 1.

Table 1 Overview of the influence of model choice and sample size on the dependent variables

naive Boot 1 Boot 09 Boot 08

Model sample size Mean width Cov Group effect Mean width Cov Group effect Mean width Cov Group effect Mean width Cov Group effect 1 200 0.63 0.29 0.58 2.54 0.82 0.51 2.62 0.83 0.51 2.73 0.85 0.50 500 0.38 0.42 0.69 1.66 0.89 0.63 1.70 0.88 0.62 1.77 0.90 0.61 1000 0.26 0.51 0.81 1.12 0.94 0.84 1.18 0.94 0.83 1.24 0.95 0.82 2 200 1.09 0.66 0.66 1.48 0.76 0.64 1.52 0.78 0.63 1.59 0.80 0.62 500 0.66 0.76 0.81 0.92 0.85 0.83 0.95 0.86 0.83 1.00 0.87 0.82 1000 0.45 0.88 0.93 0.63 0.94 0.94 0.66 0.94 0.95 0.71 0.95 0.94

Coverage and Leaf assignment measured in %, for model 1 the mean width was divided by the SD = 5 to have the same scale as for model 2

Not surprisingly, an increase in the sample size led to smaller confidence intervals, a higher coverage and also a more reliable identification of the true group effect. In case of the interval width sample size also showed a considerable three way interaction with model and method (partial η² = 0.38). For the naïve method the differences between the models were striking. Model 2 showed larger intervals and therefore higher coverages and correct specified group

(26)

differences. These differences diminished or rather reversed for the bootstrapped methods. There, the confidence intervals were larger in case of model 1, resulting in a slightly higher coverage of the true values for smaller samples becoming more alike with a growing sample size. The true identification of the group effect was still higher in model 2. This interaction of the method and the model was substantial for the width of the CIs (partial η² = 0.86) and the coverage (partial η² = 0.39). Both disappeared when the naïve method had been excluded.

5.1.3. The role of the leaves

Besides, the width of the intervals also differed between the leaves (partial η² = 0.24) and more important the interaction between both within factors seemed to play a primary role (partial η² = 0.24). If the naïve method was left out, the partial η² of leaf even increased slightly to 0.26 but the interaction between method and leaf disappeared. With respect to the coverage the factor leaf only had a minor influence (partial η² = 0.05, 0.06 if naïve was excluded), but a substantial influence on the correct identification of the group effect (partial η² = 0.09). Also, the interaction between method and leaf showed a partial η² of 0.31. Taking only the three bootstrapping methods, partial η² for leaf increased to 0.17 but again the interaction was no longer substantial.

A more detailed overview is given in Table 2a and 2b. For the interval width and the detection rate of the true group effect the three way interaction of leaf, method and model showed a partial η² of 0.17 and 0.21, respectively, only if the naïve method was part of the analysis. If not, the interaction between leaf and model still had a partial η² of 0.2 (for the interval width) and 0.13 (for the detection rate).

Table 2a Comparison of the leaves for model 1 between the different methods

Model 1 Leaf 1 Leaf 2 Leaf 3 Mean width Coverage Group effect Mean width Coverage Group effect Mean width Coverage Group effect naive 0.32 0.42 0.85 0.41 0.38 0.38 0.53 0.43 0.86 boot 1 1.14 0.93 0.78 1.93 0.79 0.79 2.25 0.93 0.40 boot 0.9 1.22 0.93 0.78 1.98 0.80 0.80 2.30 0.93 0.39 boot 0.8 1.25 0.94 0.76 2.05 0.82 0.82 2.44 0.94 0.34

(27)

Table 2b Comparison of the leaves for model 2 between the different methods

Model 2 Leaf 1 Leaf 3 Leaf 2

Mean width Coverage Group effect Mean width Coverage Group effect Mean width Coverage Group effect naive 0.73 0.85 0.85 0.73 0.74 0.74 0.75 0.71 0.81 boot 1 0.95 0.93 0.81 1.01 0.81 0.81 1.06 0.81 0.79 boot 0.9 0.99 0.94 0.80 1.05 0.82 0.82 1.10 0.82 0.78 boot 0.8 1.04 0.95 0.78 1.11 0.84 0.84 1.15 0.84 0.77

Leaf order changed to make both models comparable

As we can see, for model 1 there were obvious differences between the naïve method and the bootstrapped methods regarding all three dependent variables which were coherent for width and coverage but not for the identification of the true group effect. Whereas not many discrepancies can be seen for the first (green) leaf, the methods differed substantially for the leaf where no group effect existed (grey) and the leaf where treatment 2 was superior to treatment 1, and in case of model 1 which was imbalanced, only contained a few of the patients. It seems that the naïve methods overestimated the not existing effect for leaf 2 whilst the bootstrapped methods did not perform well in identifying the true grouping effect for leaf 3. Interestingly, these differences disappeared for a balanced model as it is the case for model 2. The naïve method was still slightly superior in identifying the true group effects for the green and the red leaf whereas the bootstrapped methods performed marginally more conservative indicated by a higher identification rate of the not existing effect in the grey leaf.

As mentioned before, the main effect of method decreased if the naïve method was excluded but did not disappear completely, indicating that the three bootstrapping methods were fairly similar. To examine whether there would be one preferable to the others dependent on sample characteristics or expected effect size Table 3 gives a brief overview.

(28)

Table 3 Comparison of the three bootstrapping approaches in relation to effect and sample size

boot 1 boot 09 boot 08

Effect size sample size Mean width Coverage Group effect Mean width Coverage Group effect Mean width Coverage Group effect 0.5 200 1.98 0.73 0.45 2.05 0.75 0.44 2.14 0.79 0.47 0.8 200 2.03 0.80 0.58 2.10 0.81 0.58 2.18 0.83 0.56 1 200 2.01 0.85 0.69 2.07 0.85 0.68 2.16 0.86 0.65 0.5 500 1.30 0.77 0.54 1.34 0.79 0.55 1.39 0.80 0.53 0.8 500 1.31 0.90 0.78 1.34 0.89 0.78 1.40 0.91 0.78 1 500 1.25 0.94 0.86 1.29 0.94 0.86 1.37 0.94 0.84 0.5 1000 0.92 0.87 0.75 0.97 0.88 0.75 1.00 0.89 0.73 0.8 1000 0.87 0.96 0.94 0.92 0.97 0.94 0.96 0.98 0.94 1 1000 0.84 0.98 0.99 0.88 0.98 0.99 0.96 0.98 0.97

The ANOVA without the naïve method revealed that the interaction effects with Method disappeared. The bootstrap 08 which has the smallest sample of the original dataset leads to slightly wider confidence interval and higher coverage. However, this comes with the disadvantage of a marginally lower proportion of the detection of the true group effect. Overall, all three bootstrap methods reacted the same to an increase in sample size and effect size and result in extremely similar values.

5.1.4. Discussion

With regard to the results of study 1, we conclude that the choice of the method influenced the results significantly. The hypothesis was that the bootstrapped methods lead to a higher coverage of the true model value and are therefore preferable compared to the currently implemented naïve method. Indeed, all three bootstrapping methods performed better than the naïve method. However, a higher coverage comes along with wider confidence intervals. This might cause a slightly lower overall detection of the true group effect, especially for smaller sample sizes which are assumed to cause a more unstable model. Additionally, an imbalanced grouping also decreases the detection rate resulting in an underestimation of the true existing effect in case of the bootstrapping approaches and in an overestimation of a not existing effect in case of the naïve intervals. For all results it should be noted that the number of replications was only k = 50. The precision of the numbers given in the tables should therefore be interpreted with caution especially for very small differences. For exploring potential effects of the three different bootstrap methods in detail the study would have to be repeated with an increased number of replications. Nevertheless, it could be clearly seen, that the bootstrapped confidence

(29)

intervals can add valuable additional information for the model output of QUINT and can increase the trust in the results. At the same time, it also makes the user aware of potential weaknesses of the dataset and points to effects that should be interpreted rather conservative. Loh et al. (2015) who originally introduced the algorithm which was also used for this study compared different interaction tree algorithms but not QUINT. They found coverages between 0.82 and 0.92 for naïve intervals and between 0.93 and 0.96 for their bootstrapped intervals. This is around the magnitude we also found for our models, even a bit higher for ‘perfect’ conditions, meaning large sample size and difference in means, and slightly smaller for worse conditions. The results also correspond to the paper from Dusseldorp and van Mechelen (2014). A larger sample size leads in general to more stable results and a smaller Type I and Type II error, even though in the original paper it is with regard to the detection of an overall present disordinal interaction and not per leaf as it was done for this study.

5.2. Evaluation of Study 2

For study 2 the two stability indices were examined and how they react to changes in resampling procedure as well as the amount of noise in the data and the sample size. Four different methods were used. Method 1 consisted of a bootstrap size which was equal the original sample size and Method 2 only of 0.9 times the original sample size. Both used the ALL resampling evaluation method (section 3.3.1.). Method 3 and 4 were evaluated with the OOB approach, having the same bootstrap sizes as method 1 and 2.

The overall correlation between Kappa and the region compatibility index varied between r = -.47 and r = -.49 depending on the method, indicating that both are somewhat similar but still measure different properties of the QUINT trees. For both indices a 2 x 2 mixed design ANOVA was conducted with the four methods that differ in terms of the resampling procedure (ALL or OOB) and the learning overlap with the original sample as within factors and sample size and model as between factors. The baseline model was model 1 from study 1, which was manipulated by adding either 0.1 % or 0.2% of noise to the data.

(30)

5.2.1. Missing Data

Two sources caused missing data. At first, for each row of the design matrix the combination was replicated 90 times. However, results from 68 replications were missing, the most for row 9 (k=24) and the least for rows 1 and 3 (k=3). No pattern was detected in the missing replications.

Furthermore, a couple of missing values occurred for the region compatibility measure, but only for methods 3 and 4. On average, 0.60 of the values were missing per 50 bootstrap samples, so less than 1 replication, for method 3 and 0.57 for method 4. A closer look revealed that the missing values only occurred for the smallest sample size. Supposedly, the reason was an implementation error and is discussed in detail in section 5.2.4. . Since the proportion of missing values was very small, it was decided to run all analyses on the remaining data.

5.2.2. Effects of the Resampling Procedure and Learning Overlap

For both stability measures a slight effect of the within factors was recognizable in the ANOVA analysis. In case of the region compatibility, the effect size partial η² for the learning overlap was 0.11 and 0.1 for the resampling procedure. For Kappa the effect of the size of the bootstrap sample was also somewhat higher (partial η² = 0.14) than for the evaluation method (0.13). The averaged measures ranged between 0.49 and 0.51 for the region compatibility with minimally lower values for methods 3 and 4. Kappa varied between 0.22 and 0.25, decreasing from method 1 to method 4. However, the differences in the mean of the coefficients were so small that it is difficult to assume a large influence of the method. None of the between factors showed any considerable influence on the RC index. In contrast, both of them, model (partial η² = 0.16) and the sample size (partial η² = 0.41), influenced the size of the Kappa coefficient significantly. Also the interaction of these two factors contributed slightly (partial η² = 0.11). Table 4 shows the described effects in more detail.

(31)

Table 4 Overview of the two stability measures Region compatibility (RC) and Cohen’s Kappa dependent on the resampling methods

Method1 Method2 Method3 Method4 Model Sample RC Kappa RC Kappa RC Kappa RC Kappa

1 200 0.51 0.18 0.51 0.16 0.51 0.17 0.51 0.16 500 0.50 0.27 0.51 0.26 0.50 0.25 0.50 0.24 1000 0.46 0.46 0.47 0.43 0.46 0.42 0.47 0.39 2 200 0.50 0.17 0.51 0.16 0.50 0.16 0.50 0.16 500 0.51 0.20 0.51 0.20 0.50 0.20 0.50 0.19 1000 0.50 0.35 0.50 0.33 0.49 0.33 0.50 0.31 3 200 0.50 0.16 0.51 0.15 0.50 0.16 0.50 0.15 500 0.52 0.19 0.52 0.18 0.50 0.17 0.51 0.17 1000 0.51 0.25 0.52 0.24 0.50 0.24 0.50 0.23 Note: Method1: ALL + boot sample size = 1, Method2: ALL + boot sample size = 0.9, Method3: OOB + boot sample size = 1, Method4: OOB + boot sample size = 0.9

It can be seen that Kappa has the highest values for the most stable model (model 1) and the largest sample size. It furthermore tends to have slightly higher values for method 1 than for the others. Method 2 and 4 both include only 80% of the original sample and were therefore assumed to be more diverse. This cannot be confirmed by the actual data. RC on the other hand only shows a slight reaction to the sample size for model 1. No differences can be seen for all other factor combinations.

5.2.3. Structural Stability

To get an idea of how the indices reacted to structural similarities between the trees and how stable the models were overall four different structural measures were extracted. For each bootstrap run it was listed whether the two trees have the same number of leaves after being pruned (cond1) and if this was the case whether these leaves also had the same class (cond2). Cond2 was therefore a stricter subset of cond1. Additionally, it was checked whether the two trees share the same first two split variables (cond3) and if this was the case whether they also have the same split point (cond4). Again cond4 was a stricter subset of cond3. On average, 13.4 out of 50 pairs of models fulfilled cond1, but only 3.1 cond2. Around 6 pairs shared the same first two splitting variables and only 1% of all pairs of trees also the same split points. Assuming that the detection of the first two split variables was a fairly good indicator whether the true

(32)

model was found even though there might follow more splits than in the original model, it seemed that the overall detection rate and the structural stability can be assumed as fairly low. Moreover, the inspection of the structural indices confirmed the findings of the influence of the between factors. Both, sample size and model choice did influence the probability of drawing two datasets which lead to structurally similar trees. The details can be seen in figure 2a and 2b.

Figure 2a and 2b Overview of the influence of the between factors on the structural indices

As we can see both factors influenced the structural measures. The number of leaves itself was not a good indicator for the stability of the model, because it does not say anything about the assignment of subjects. Accordingly, no differences could be seen for cond1. The equality of the type of leaves, in contrast, was influenced by both, occurring more often in case of more stable models, in other words a higher sample size and a noise-free model. The same applies to cond3 and cond4 even though the overall occurrence for cond4 was so small that the differences can barely be seen.

For each of these subsets also the average RC and Kappa coefficients were calculated. Additionally, the standard deviations are given in form of error bars. Figure 2 displays the results in more detail. Both coefficients react on the structural differences, Kappa more clearly than the RC measure does if the difference of the absolute values is taken as a measure. The error bars are relative large which might be because of the small number of cases. Somewhat smaller for the RC measure the error bars could give a hint this coefficient is limited in its general range (especially upwards). Moreover, the results give an idea which threshold values

(33)

can be expected for the RC (minimum value) and for Kappa (maximum value) in case of almost perfectly stable QUINT trees.

Figure 3 Reaction of the two stability indices on structural differences

5.2.4. Discussion

Study 2 aimed to examine two different stability indices – Kappa which only measures semantic stability and the region compatibility coefficient from Wang et al. (2018) which is according to the authors sensitive toward structural differences between trees. The averaged Kappa reached its maximum for the stable model and a large sample size and was smaller for the more unstable models and smaller sample sizes. In contrast, the region compatibility index was mainly the same for all conditions and did not show much variance. In a second step, these coefficients were examined in relation to some structural features of the trees. Both coefficients were higher (Kappa) or lower (RC) for stricter conditions, confirming our hypotheses. Taking the absolute size of the coefficients into account, Kappa was more sensitive to changes in stability than RC. Additionally, the resampling approaches suggested by Phillip et al. (2018) were used in order to examine the influence of the resampling strategy and the overlap between learning and

(34)

original sample on the stability of decision trees. Our results showed that these two factors do not have large influence on the data.

Overall the small coefficient sizes and the small number of stable models were surprising compared to other studies (Dusseldorp & van Mechelen, 2014; Philipp et al., 2018). One potential reason is that the chosen true model itself was already not very stable as we could have seen in study 1. Because of the imbalanced design, QUINT has difficulties to detect the true group effects. This might introduce a particular amount of randomness in the model fitting process. Additionally, for each run the model was not compared to the true model as it is often done in other publications but to another bootstrap sample. This means, that if QUINT did not succeed to find the true model in at least one of the two samples the probability that the same wrong model was found twice was very low, so the stability of QUINT in general is probably underestimated. Dusseldorp and van Mechelen (2015) for instance also give Kappa values for their models. Comparing their results of model b, which bases on the same model structure as model 1 in study 2, with our results they report a mean Kappa for a sample size of 1000 of around 0.67. They got this value by comparing the true assignment to the partition classes with the assignment of each repetition, meaning that one of the two models which were compared had already been fixed. So it is not surprising that the averaged Kappa for our study is smaller, only 0.44 for the same true model. Taking some of the controversial attempts for a classification of Kappa, this size would be categorised as acceptable or moderate (Fleiss, 1973; Landis & Koch, 1977) but the smaller Kappas for all other conditions rather as not trustworthy or fair depending on the source.

The region compatibility measure is equal to zero if two trees have the exact same decision regions (Wang et al., 2018). Accordingly a smaller value indicates a higher stability. Furthermore the authors show that a tree without any identical decision regions had a RC coefficient of around 0.5. Referring to our results, the averaged trees all have coefficients in this size. A remarkable difference can only be seen for the very strict structural assumption of two trees sharing the same first two split variables and their split points. In this case the average RC dropped to 0.25. However, the evaluation of the index by Wang et al. (2018) themselves also showed that the range of the coefficient is expected to be not that large, they reported coefficients between 0.3 and 0.5. Consequently, it must be considered in how far a coefficient which is limited to a rather narrow range is capable of reacting to changes in the data. The authors show that it reacts well to the perturbation ratio of different data sets, increasing with