What (not) to expect when classifying rare events

(1)

What (not) to expect when classifying rare events Rok Blagus¹, Jelle J. Goeman²

1Corresponding author: Rok Blagus, Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2, 1000 Ljubljana, Tel: +38615437776, Fax:

+38615437771, e-mail: rok.blagus@mf.uni-lj.si

2Leiden University Medical Center, Department of Medical Statistics and Bioinformatics, Leiden, The Netherlands.

Key-words: prediction models, rare events, optimization, overestimation, balanced sensitivity and specificity, g-means

1

(2)

What (not) to expect when classifying rare events

Rok Blagus^*, Jelle J. Goeman Abstract

When building classifiers, it is natural to require that the classifier correctly estimates the event probability (constraint 1), that it has equal sensitivity and specificity (constraint 2), or that it has equal positive and negative predictive values (constraint 3). We prove that in the balanced case, where there is equal proportion of events and non-events, any classifier that satisfies one of these constraints will always satisfy all. Such unbiasedness of events and non-events is much more difficult to achieve in the case of rare events, i.e. the situation in which the proportion of events is (much) smaller than 0.5. Here, we prove that it is impossible to meet all three constraints unless the classifier achieves perfect predictions. Any non- perfect classifier can only satisfy at most one constraint, and satisfying one constraint implies violating the other two constraints in a specific direction.

Our results have implications for classifiers optimized using g-means or F₁ -measure, which tend to satisfy constraint 2 and 1, respectively. Our results are derived from basic probability theory and illustrated with simulations based on some frequently used classifiers.

(3)

1 Introduction

The goal of clinical research is often to estimate the probability of one of two levels of a binary outcome variable (event). Examples of events in clinical studies include the presence of a disease, its recurrence, or the response to a treatment. For this purpose several characteristics of the subjects are measured and the goal is to develop a prediction model (classifier) using a group of subjects with known event status (training set) to later predict the occurrence of the event for subjects for which it is yet unknown whether the event occurs or not (prognostic models) [1].

Prediction models are used also for estimating class, a status which is already present at the time of estimating the class membership (diagnostic models) [2]. The results presented in this paper apply to both prognostic and diagnostic models.

Prediction models are extensively used in bioinformatics (see [3, 4] and the references therein) with the aim to develop personalized treatments or individualized drug selection [5-9]

and represent a valuable tool in the decision making process of clinicians and health policy makers. For example, the use of mass spectrometry to develop profiles of patient serum proteins could lead to early detection of ovarian cancer, which has the potential to reduce mortality [10, 11]. Some of the classifiers most frequently used in bioinformatics include nearest neighbor (k- NN, [12]) and nearest centroid classifiers [13], classification trees [14], random forests (RF, [15]) and support vector machines (SVM, [16]) (see [17] or [18] for an introduction to these methods).

When building prediction models it is common that the prevalence of the events is (much) smaller than the prevalence of the non-events (rare events). It was shown that predictive models developed on data with rare events tend to be overwhelmed by the non-events and ignore the events [19]. The same issue occurs also in diagnostic models where the classifiers favor the majority class samples over the minority class samples (class-imbalance problem) [20]. This is

(4)

especially important in bioinformatics where due to a large number of variables in comparison with the number of samples (high-dimensionality) this problem is exacerbated [21-27] and it appears in various subfields. For example, prediction of protein families [28], prediction of protein domain types (see [29] and the references therein) and prediction of proteolytic cleavage (see [30] and the references therein) are typical class imbalance problems, where the number of events may be a few dozen while the number of non-events can go up to thousands or even millions. For example, the Human Protein Reference Database [31] which can be used for the prediction of protein-protein interactions (PPI) [32] contains 5000 positive and 2 000 000 negative PPIs, thus the prevalence of the positives is only 0.25%. In this paper however, we speak of rare events as soon as the number of events is smaller than the number of non-events.

A crucial issue when evaluating the predictive models developed with rare events is a choice of an appropriate performance measure. Overall predictive accuracy, a proportion of correctly classified samples, and overall error rate, a proportion of misclassified samples, are in this setting misleading measures as they favor classifiers which accurately predict non-events. As an example, consider a situation where the proportion of events is 5%. Then the classifier which classifies all samples as non-events achieves a predictive accuracy of 0.95 and an error rate of only 0.05, but it has zero predictive accuracy for the events. In a scenario where the number of events and non-events is equal the same classifier would achieve the predictive accuracy and error rate of 0.5 and would be considered a poor classifier.

Several metrics have been proposed that balance the performance of the classifier for the events and the non-events. The most commonly used of these are ^F1 -measure and g-means.

F₁ -measure is the harmonic mean between precision (proportion of true events among the samples predicted as events, also referred to as positive predictive value) and accuracy for the

(5)

events (proportion of true events that are correctly classified, also referred to as sensitivity). It gives equal emphasis on precision and sensitivity. One issue when using ^F1 -measure with rare events occurs when all samples are predicted as non-events, as in this case this requires division by zero in the calculation of precision, which mathematically is not defined. The g- means is the geometric mean between sensitivity and the proportion of non-events that are correctly classified (also referred to as specificity). It puts equal emphasis on the accuracies for the events and the non-events. Both measures are commonly used in bioinformatics [27, 33, 34].

Sampling techniques are often used with rare events to improve the performance of the predictive models [35-38]. Their aim is to obtain a balanced distribution of events and non-events prior to building the prediction model. Undersampling techniques remove some of the non- events, while oversampling methods generate additional events based on the observed data [39].

The purpose of using sampling techniques is to balance the accuracies for the events and the non- events [24, 27, 40]. Sampling techniques can be seen as an example of separate sampling, where the number of events and non-events are predetermined and are not random. It was shown, however, that with separate sampling the data cannot be used to estimate the event probabilities and that using predetermined numbers of events and non-events may degrade the performance of the classifier in terms of overall accuracy [41, 42].

When the goal of the prediction model is to develop personalized treatments or individualized drug selection, besides achieving high predictive accuracy, several goals may be formulated, which we formulate as constraints below. It is natural to desire that the prediction model accurately estimates the event probability (constraint 1). We demonstrate that in the setting where the number of events and non-events are equal this implies that the probabilities of correctly classifying events and non-events are equal (constraint 2) and that the positive

(6)

predictive value and the negative predictive value (proportion of true non-events among the samples predicted as non-events) are equal as well (constraint 3). Satisfying these latter constraints can be desirable in their own right from the perspective of balanced treatment of events and non-events. However, we demonstrate that with rare events meeting these constraints is much more difficult. We show that in this setting the non-perfect classifiers can meet at most one constraint, and that meeting any one constraint implies violating the other two constraints in a specific direction. We present simple rules, based on a review of basic probability theory, which have high relevance when the classifiers are applied to rare events. Implications are explored for the situation when the constraints are satisfied as inequalities, e.g. when sensitivity is required to be at least as high as specificity. It is also shown, under some assumptions, which constraints are imposed on the classifier when optimizing its performance based on either g- means or ^F1 -measure. The theoretical results are illustrated on the high-dimensional simulated data example.

The remainder of the paper is organized as follows. In the next section we introduce notation and present our theoretical results. The subsequent section will then illustrate our propositions by a simulated data example. The paper concludes with a discussion where we summarize and evaluate our main findings.

2 Results

In this section we present our theoretical results, for which proofs are in the Appendix. The results are general and apply to any classifier. The section is divided in three parts. In the first part we introduce the notation and the formal constraints. In the second part we present the

(7)

consequences of satisfying these constraints in a series of propositions. In the third part we show, under some assumptions, which of these constraints are imposed on the classifier when its performance is optimized based on some performance metrics.

2.1 Notation

Let E and N denote the class status of the population members for the events and non-events, respectively and let P(E) and P(N) denote the proportion of events and non-events in the population under study. Throughout the paper we assume P(E)≤ P(N) . Let ^E and ^N denote the predicted event status and non-event status, respectively and let P(^E) denote the probability of predicting an event status. Note that P (E )=1−P (N) and P(^{^}E)=1−P(^N ) . A representation of the classifier's performance can be formulated by a confusion matrix (contingency table), as given in Table 1.

Table 1: about here.

Assume that one or more of the following constraints can be imposed on a classifier, constraint 1: P(^{^}E)=P(E) , (correctly estimated event probability)

constraint 2: P(^{^}E

|

E)=P (^N∨N ) , (equal sensitivity and specificity) constraint 3: P(E

|

^{^}E)=P(N

|

N^{^}). (equal positive and negative predictive value)

The three constraints can be of interest by themselves, e.g. because they can be imposed by certain performance measures, as we shall see in Section 2.3. In other situations they can be of

(8)

interest as central cases reflecting a desire for equal treatment of events and non-events. In the literature, constraints 1 and 2 have so far received more attention than constraint 3.

Throughout, we assume that 0<P(E)≤1 /2 . We allow predictors with P(^{^}E)=0 or P(^{^}E)=1 , for which we define P(E

|

E^{^})=0 and P(N

|

^{^}N)=0 , respectively. Note that with this definition constraint 3 excludes the possibility that P(^{^}E)=0 or P(^{^}E)=1 , since those imply both P(E

|

^{^}E)=0 and P(N

|

^{^}N)=0 and therefore that either P (N )=0 or P (E )=0 .

2.2 Class-prediction as an optimization process

For the class-balanced case, P (E )=P ( N )=1/2 , we have the following proposition.

Proposition 1. In the class-balanced case, P(E)=1/2 , constraint 1 and 2 each imply the other two constraints. If ^E and E are not independent, constraint 3 implies the other two constraints.

According to Proposition 1, if a classifier is better than random guessing and meets any one of the constraints, then in the setting with equal number of events and non-events, the other two constraints will also be met. It is reasonable, therefore, to demand that classifiers in this setting meet all three constraints.

In general however, allowing the class-imbalanced case, P (E )<1/2 , we have the following propositions that link the three constraints.

Proposition 2. Constraint 1 is met if and only if P(E

|

^{^}E)=P(E^{^}

|

E) and if and only if P(N∨ ^N )=P (^N∨N ) .

(9)

By this proposition, if we have any classifier that correctly estimates the event

probability, it can only have high sensitivity if it has good positive predictive value, and it can only have high specificity if it has good negative predictive value. In fact, we can only have a discrepancy between sensitivity and positive predictive value, or between specificity and negative predictive value if we have a classifier that over- or underestimates the event probability.

Proposition 3. Any combination of two constraints implies the third one.

If we want two of the constraints, we get the third one for free. They are intimately linked.

Proposition 4. Any combination of two constraints implies P(E)=1/2 or P(^{^}E

|

E)=P(N^{^}

|

N)=P(E

|

E^{^})=P(N

|

^{^}N)=1 .

The Propositions 3 and 4 are important as they show that in the setting with P (E )<1/2 the classifiers cannot meet more than one constraint unless they achieve perfect predictions. In bioinformatics predictive models almost never achieve perfect predictions and therefore bioinformaticians must carefully decide which constraint (if any) the model should achieve.

However, meeting one will lead to failing both the other constraints. In fact, they will fail in specific directions, as we shall see below.

For the classifiers which fail to perfectly discriminate between the events and the non-events, the following propositions show what happens in the class-imbalanced case if each one of the individual constraints is met.

Proposition 5. If constraint 1 is met, P(E)<1/2 , and the classifier does not achieve perfect predictions, then P(^{^}E

|

E)<P( ^N ∨N ) and P(E

|

^{^}E)<P(N ∨^N ) .

(10)

According to this proposition, any imperfect classifier that correctly estimates the event probability in the rare event case will always perform worse for the events than for the non- events. This holds both in terms of sensitivity, which is worse than specificity, and positive predictive value, which is worse than negative predictive value. Unbiased estimation of the event probability creates bias in favor of the non-events. It is simply easier to do well for the majority.

The extent to which sensitivity is smaller than specificity and positive predictive value is smaller than negative predictive value depends on the degree of imbalance between events and non-events, and on the quality of the classifier. Poor classifiers have a larger discrepancy between positive and negative predictive value and between sensitivity and specificity than good ones. For example, a classifier with P(^{^}E

|

E)+P(^{^}N

|

N)=1.5 can be calculated to have

P(N^{^}

|

N)=0.8 and P(^{^}E

|

E)=¿ 0.7 if P (E )=0.4 , while it would have P(N^{^}

|

N)=0.95 and P(^{^}E

|

E)=¿ 0.55 if P (E )=0.1 . A better classifier with P(^{^}E

|

E)+P(^{^}N

|

N)=1.7 would be less imbalanced, with P(N^{^}

|

N)=0.97 and P(^{^}E

|

E)=¿ 0.73 if P (E )=0.1 .

Proposition 6. If constraint 2 is met, P (E )<1/2 , and the classifier does not achieve perfect predictions, then P(^{^}E)>P(E) and P(E

|

^{^}E)<P(N ∨^N ) .

By this proposition, if in the rare event case we want equal predictive accuracy for the events and non-events we must favor the events by overestimating the event probability. Still, classifiers which achieve this will still have a positive predictive value that is less than the negative predictive value, so that in that sense they will still perform worse for the events than for the non-events.

The magnitude of overestimation of the event probability by classifiers achieving constraint 2 will depend on P(^{^}E

|

E)=P(N^{^}

|

N) , where the bias will be larger when P(^{^}E

|

E)=P(N^{^}

|

N) will

(11)

be smaller, i.e. the classifier is further away from the perfect one. For example, in the setting where P (E )=0.1 , the classifier which attains P(^{^}_E

|

E)_=P(_N^{^}

|

N)=0.9 , will overestimate the event probability by P(^{^}E)−P ( E)=0.18−0.1=0.08 (80%), while the classifier which attains

P(^{^}E

|

E)=P(N^{^}

|

N)=0.6 , will overestimate the event probability by 0.32 (320%).

While constraint 2 is not often required to hold exactly in bioinformatics, it is very common to require that the events are predicted more accurately than the non-events, i.e.

P(^{^}E

|

E)>P(^{^}N

|

N) . This can only be achieved by increasing the estimated event probability even more, and such classifiers will also have P(^{^}_E)>P(E) . Positive predictive value will typically suffer from this. As an example consider again the situation where P(E)=0.1 . Assume that the classifer correctly predicted 90% of events and 80% of the non-events, i.e.

P(^{^}E

|

E)=0.9 and P(N^{^}

|

N)=0.8 . While the predictive accuracy for the events for this classifier is high, its predictive value for the events, P(E

|

^{^}E)=0.33 , is poor.

Constraint 3 is slightly more complicated. We must distinguish three cases in which that constraint holds.

Proposition 7. Constraint 3 implies that exactly one of the following statements is true

1. P(E

|

^{^}E)=P(N

|

N^{^})<P (E ) ;

2. P(E

|

^{^}E)=P(N

|

N^{^})=P ( E)=1−P (E )=1/2 ; 3. P(E

|

^{^}E)=P(N

|

N^{^})>1−P(E) .

We call case 1 from Proposition 7 ‘inverted classifiers’ as they tend to predict the wrong class. Such inverted classifiers can always be transformed into classifiers with superior

(12)

performance by reversing their classification rule. They are not of real interest. Case 2 is restricted to the balanced case, where it represents the case where prediction and event are independent. This case was excluded in Proposition 1. We restrict to informative classifiers (case 3) for the next proposition.

Proposition 8. If constraint 3 is met, P (E )<1/2 , and P(E∨ ^E)>1−P ( E) , but the

classifier does not achieve perfect predictions, then P(^{^}E)<P(E) and P(^{^}E

|

E)<P( ^N ∨N ) . We see that to achieve constraint 3, if the event is rare, informative classifiers must underestimate the event probability, and as a result have lower predictive accuracy for the events than for the non-events.

Propositions 5, 6, and 8 are especially illuminating when considered together. Consider an informative but not perfect classifier in the rare event situation. If we let the classifier correctly estimate the event probability we do worse for the events than for the non-events in terms of both sensitivity versus specificity and positive versus negative predictive value (Proposition 5). To favor sensitivity over specificity we must overestimate the event probability (Proposition 6), and to favor positive over negative predictive value we must underestimate it (Proposition 8). Solving one imbalance makes the other imbalance worse. This is the catch-22 situation of prediction of rare events: at least one of predictive value or predictive accuracy will be worse for the events than for the non-events.

2.3 Optimization of a classifier’s performance based on g-means and F1 -measure In this section we present which of the constraints presented in the previous section are imposed on a classifier when its performance is optimized based on g-means or F1 -measure.

The results are valid under the assumptions:

(13)

A1: when optimizing the performance of the classifier the sum of sensitivity and specificity is fixed

A2: when optimizing the performance of the classifier the sum of sensitivity and precision is fixed

Proposition 9. g-means is maximized under assumption A1 when P(^{^}E

|

E)=P (^N∨N ) . F1

is maximized under assumption A2 when P(^{^}_E)=P(E) .

In bioinformatics it is common to optimize the classifier's performance based on some performance criteria. Loosely speaking, we can say that even when the assumptions A1 or A2 do not hold optimizing g-means tends to favor classifiers for which constraint 2 is true, whereas optimizing ^F1 -measure tends to favor classifiers for which constraint 1 holds. This will be illustrated in Section 3. As a consequence, by Proposition 6, optimizing the g-means will lead to overestimated event probabilities while, by Proposition 5, optimizing the ^F1 -measure will lead to smaller predictive accuracy for the events than for the non-events. In both cases we can expect positive predictive value to be lower than negative predictive value, especially if g-means was used.

3 Examples

In this section we illustrate our theoretical results using a simulated data example with 500 samples and 1000 correlated variables using a block exchangeable correlation structure [43, 44].

Different proportions of events are considered ( P(E)=¿2 and P(E)=0.1 ). Four classifiers, commonly used in bioinformatics, are used: nearest neighbor based on Euclidean distance (1-NN [45]), diagonal linear discriminant analysis (DLDA [46]), support vector machine

(14)

(SVM [47], using radial (SVMR) and linear (SVML) kernels) and random forest (RF [48]). See [24] for a more detailed description of the classifiers. The performance of the classifiers was evaluated with a large independent test set and was summarized with seven measures as given in Table 2, where ^PA1 , ^PA0 , ^PV1 , ^PV0 and ^P1 are the respective estimates of

P(^{^}E

|

E) , P(N^{^}

|

N) , P(E

|

^{^}E) , P(N

|

^{^}N) , and P(^E) . Table 2: About here.

In the setting with P (E )=0.1 some techniques which were proposed to solve the class- imbalance problem, e.g. changing the classification threshold and randomly undersampling the majority class have also been considered [19]. More details on the classifiers and the data simulation can be found in the Supplementary information.

This section is divided in two parts. In the first part we show the performance of the classifiers under two settings, i.e. the setting with P (E )=¿2 , and the setting with P (E )=0.1 . In the second part we illustrate the performance of SVMR, SVML and RF in the setting with P (E )=0.1 , where the performance of the classifiers was optimized based on g- means or F1 -measure.

3.1 Performance of the classifiers

In the setting with P(E)=1/2 all the classifiers met all 3 constraints, i.e. the class-specific accuracies and predictive values were equal ( ^PA1=PA₀ and ^PV1=PV₀ ) and the proportion of the predicted events was the same as the true proportion of the events, P₁=P ( E) (Table 3 where the results averaged over 1000 simulation runs and standard deviation (in brackets) are

(15)

reported). However, as suggested by our theoretical results, in the setting with P (E )=0.1 (Table 4), none of the classifiers met all three constraints.

Table 3. About here.

The classifiers trained on undersampled training sets met constraint 2. This can be explained from Propositions 1 and 4: as the undersampled data was balanced it is natural to meet all three constraints in that data. Going back from the undersampled to the original data sensitivity and specificity are stable so constraint 2 is retained. By Proposition 4 at most one constraint can be retained because the classifier is not perfect, so constraints 1 and 3 are lost. Indeed the proportion of predicted events was too large ( P1>P ( E ) ) and ^PV1<PV₀ . As expected, the proportion of the predicted events was closer to the proportion of events if sensitivity and specificity were larger. SMVL-CO also met constraint 2. The other classifiers did not meet any constraint. SVML was the closest to meeting constraint 1, while none of the classifiers met constraint 3. Some classifiers (SVMR and RF) predicted all new samples as non-events, hence PV1 could not be estimated and was, for convenience, set to 0.

3.2 Optimizing the performance of the classifiers in the class-imbalanced scenario Here we illustrate, for the setting with P(E)=0.1 , the consequences of optimizing the performance of the classifier based on g-means or ^F1 -measure. The performance of SVMR, SVML and RF was optimized by changing the classification threshold. The classification threshold leading to the largest g-means or ^F1 -measure was used for classifying test set samples. The results are reported in Table 5. As expected from the propositions, when the

(16)

performance was optimized based on g-means, the accuracies for the events and non-events were equal, ^PA1=PA₀ and the proportion of predicted events was too large, P1>P (E) . When F₁ -measure was used to optimize the performance of the classifiers, the proportion of predicted events was the same as in the studied population, P₁=P ( E)=0.1 , and the accuracy for the events and the predictive value for the events were also similar, ^PA1≈ PV₁ . Note the large variability of these estimates.

4 Discussion

When predictive models are developed it is important to specify the goals that the predictive model should achieve. Besides achieving high predictive accuracy, the prediction models could be required to also accurately estimate the event probability (constraint 1). We showed that in the setting where the number of events and non-events is equal, classifiers which correctly estimate the event probability also achieve equal accuracy for the events (sensitivity) and for the non- events (specificity) - constraint 2, as well as equal predictive values for the events (positive predictive value) and for the non-events (negative predictive value) - constraint 3. However, when the classifiers are developed on data with rare events, it is impossible to achieve any pair of constraints unless the classifier achieves perfect predictions. It is demonstrated by a simulated example with rare events, that the classifiers which meet one constraint are closer to meeting the other constraints if their predictions are more accurate.

In order to improve the performance of predictive models in the presence of rare events, classifiers are often developed on training data where a balanced distribution was obtained prior

(17)

to building the prediction model. Such models tend to achieve equal sensitivity and specificity and will consequently overestimate the event probability when applied in the rare events situation. This was demonstrated in our simulated data example for classifiers which were trained on undersampled training data. Classifiers with lower sensitivity and specificity overestimated the event probability by more than the classifiers with higher sensitivity and specificity. These results complement the results presented in [41], where it is shown that the error rate of the classifier is larger if the proportion of events on the training set is different from that in the population.

We emphasize the importance of the choice of the overall performance measure on the resulting classifier. Focusing only on the overall error rate will inevitably favor classifiers which perform poorly for the rare events, while this gives the researchers no guarantee that the probabilities will be correctly estimated, even with random sampling, as can be clearly seen in our example with rare events where none of the classifiers met constraint 1.

Our results also show that, with rare events, the classifiers which accurately estimate the event probability cannot classify events and non-events with equal precision. When predicting rare events it is common that misclassifying events is more costly than misclassifying non-events [49-51]. In other words, the predictive models which achieve higher sensitivity than specificity are favored over the predictive models where specificity is larger than sensitivity. Such classifiers must overestimate the event probability. An example of such classifier in our simulated data example was random forest with adjusted threshold (RF-CO), which achieved sensitivity and specificity of 0.9 and 0.6, respectively and severely overestimated the event probability (0.44 instead of 0.1).

(18)

In our simulated example none of the considered classifiers met constraint 3. According to our theoretical results this is not surprising as constraint 3 can only be met by classifiers for which positive and negative predictive values both exceed the proportion of the non-events, in our simulated data example 0.9. According to our theoretical results meeting constraint 3 becomes more difficult when the proportion of events gets smaller. Also, the classifiers which meet constraint 3 are shown to underestimate the event probability.

Under certain assumptions optimizing the predictive performance of the classifier based on the ^F1 -measure imposes a constraint that the sensitivity and positive predictive value are equal. We showed that this can hold only when the proportion of the predicted events equals the proportion of the events in the studied population, therefore this imposes constraint 1. Similarly, we showed that optimizing the classifier’s performance based on g-means imposes a constraint that the predictive accuracies for the events and non-events are equal (constraint 2). This result assumes a diagonal line in the ROC space [52, 53]. Most ROC curves are more curved of course, making the point of equality even the point that maximizes the sum of sensitivity and specificity itself. Still, it is not guaranteed that a classifier maximizes the g-means if sensitivity and specificity are equal. It can be argued from the proposition and general shapes of ROC curves that optimizing g-means typically achieves near-equality, which is also seen in our simulated data example.

In conclusion, ideally, a classifier would have high predictive ability for new data, would accurately estimate the event probability as well as accurately classify both, the events and the non-events. Unfortunately, if one is not blessed with a perfect classifier, i.e. the classifier which correctly classifies any new sample, simultaneously achieving these goals in the presence of rare events is not possible. Therefore the researchers have to decide which goal is the most important

(19)

for the task at hand. Achieving this goal will however lead to failing the other goals. Our results, besides being relevant for the event-prediction framework, are equally valid in the class- prediction framework, for example in diagnostic testing [54] when one class occurs much less frequently than the other class.

Key points

• In the setting with equal number of events and non-events the classifiers can correctly estimate the event probability and classify events and non-events with equal precision.

• With rare events, the classifiers cannot simultaneously correctly estimate the event probability and classify events and non-events with equal precision unless they can correctly classify every new sample.

• With rare events, the classifiers which correctly estimate the event probability will classify events with smaller precision than non-events.

• With rare events, the classifiers which classify events and non-events with equal precision will overestimate the event probability.

• With rare events the informative classifiers can achieve equal positive and negative predictive values only when they both exceed the proportion of the non-events. Such classifiers will underestimate the event probability.

Biographical note

Rok Blagus is an assistant professor at the Institute for Biostatistics and Medical Informatics at the Faculty of Medicine, University of Ljubljana. His research interests involve developing accurate prediction models for bioinformatics.

(20)

Jelle Goeman is professor of biostatistics at the Department of Medical Statistics and Bioinformatics at Leiden University Medical Center. His research interests are in hypothesis testing and predictive modeling in high-dimensional data.

Appendix: Proofs of the propositions

Proof of Proposition 1. To prove the proposition we prove that constraints 1 and 2 imply each other, that constraint 1 implies constraint 3, and if ^E and E are not independent, that constraint 3 implies constraint 1.

By the law of the total probability, we have

P(^{^}E)=P(^{^}E

|

E)P ( E)+

(

^1−P⁽^{^}^N

|

N)

)

^{P (N )=}¹₂^P⁽^{^}^E

|

E)+1

2

(

^1−P⁽^N^{^}

|

N)

)

^.

From this, we see that constraint 1 implies that 1

2=1

2P(^{^}E

|

E)+1

2

(

^1−P⁽^{^}^N

|

N)

)

^,

from which constraint 2 follows. Conversely, constraint 2 implies that P(^{^}E)=1 /2 , which implies constraint 1. This shows that constraints 1 and 2 are equivalent.

Similarly, by the law of total probability we have

P (E )=P(E

|

E^{^})P(^{^}E)+

(

1−P(N

|

^{^}N)

)

P(^{^}N).

If constraint 1 holds, we have 1 2=1

2P(E∨^E)+1

2

(

^{1−P(N ∨^}^{N )}

)

^,

which implies constraint 3. Conversely, if constraint 3 holds, by the law of total probability, we have

(21)

P (E )=P(E

|

E^{^})P(^{^}E)+

(

1−P(E

|

^{^}E)

)(

1−P(^{^}E)

)

=1−P(E

|

E^{^})+P(E^{^})

(

2 P(E∨^E)−1

)

⁽¹⁾

Since ^E and E were assumed not to be independent, we have P(E

|

^{^}E)≠ P ( E )=1/2 , so that

P(^{^}E)=P(E)+P(E

|

^{^}E)−1

2 P(E

|

E^{^})−1 =P(E

|

E^{^})−1/2 2 P(E

|

^{^}E)−1=1

2 so that constraint 1 follows. □

Proof of Proposition 2. This follows immediately from Bayes’ rule:

P(E

|

^{^}E)=P( ^E∨E)P (E)

P(^E) ; P(N

|

N^{^})=P(N^{^}

|

N)P (N)

P( ^N ) . (2)

□

Proof of proposition 3. By proposition 2 if constraint 1 is met constraints 2 and 3 imply each other.

If constraints 2 and 3 hold, we obtain from (2)Error: Reference source not found that P( E)

P( ^E)=P(N ) P(^N ), from which constraint 1 follows. □

Proof of Proposition 4. By proposition 3 we can assume that all constraints apply. By the law of total probability, we have

P(^{^}_E)_=P(^{^}_E

|

E)P ( E)+(1−P(^{^}_N

|

N))(1−P( E)) ,

Substituting P(^{^}_E) for P(E), and P(^{^}_E∨E) for P(^N∨N ) and rearranging terms, we get

(22)

2 P(_E^{^})_−1=P(^{^}_E∨E)_{(2 P}(^{^}_E)_{−1) ,}

which holds if and only if P(^{^}_E

|

E)_=P(_N^{^}

|

N)=1 or P(E) = P(^{^}_E) = 1/2. □ Proof of Proposition 5. By constraint 1 and the law of total probability we have,

P (E )=P(^{^}E)=P(E^{^}

|

E)P( E )+(1−P(N^{^}

|

N))(1−P(E)) . Rearranging terms, we obtain

1−P(^{^}N

|

N)= P (E )

1−P ( E )

(

^1−P⁽^{^}^E

|

E)

)

^.

Since P(E)<1/2 then 1−P(^{^}N

|

N)<1−P (^E∨E) . By Proposition 2 the same holds for P(E∨ ^E) and P(N∨ ^N ) . □

Proof of Proposition 6. By the law of total probability and constraint 2,

P(^{^}E)=P(^{^}E∨E)P( E )+

(

1−P(^{^}E∨E)

)

P ( N )=(1−2 P ( N ))^P(E∨E^{^} )+P ( N ) ,

whence, since P(^{^}_E∨E)<1 and 1−2 P (N )<0 , it follows that P(^{^}_E)>1−P ( N )=P (E) . Since P(^{^}_E)>P(E) and P(_N^{^})<P(N ) , and using constraint 2, we have

P(E

|

^{^}E)=P(^{^}_E

|

E)_P(E)

P(^E) <P(^{^}E

|

E)=P(N^{^}

|

N)<P(_N^{^}

|

N)_{P(N )}

P(^N ) =P(N

|

^{^}N). □

(23)

Proof of Proposition 7. Remember that constraint 3 implies that 0<P(^{^}_E)<1 . By the law of total probability, and using constraint 3, we have (1)Error: Reference source not found. If

P(E∨ Ê)=1/2 this implies P(E) = 1/2 so that Statement 2 is true. If P(E∨ Ê)>1/2 , we have P (E )>1−P(_{E∨ ^}_E) , which implies Statement 3. If P(_{E∨ ^}_E)<1/2 , since P(^{^}_E)_{<1 ,} we have P (E )>P(E∨Ê) , which implies Statement 1. As P (E )≤ 1/2 the tree statements are mutually exclusive. □

Proof of Proposition 8. From (1)Error: Reference source not found we have, since P (E )<1/2 and P(_{E∨ ^}_E)>1/2 , that

P(^{^}E)=P ( E)+P(E

|

^{^}E)−1

2 P(E

|

E^{^})−1 < P(E

|

E^{^})−1/2 2(P(E

|

^{^}E)−1

2)

=1 2.

By the law of total probability and constraint 3,

1−P(E

|

E^{^})P(^{^}N)⁼

(

1−2 P(^{^}N)

)

P(E

|

^{^}E)⁺P( ^N ) P ( E )=P(E

|

^{^}E)P(^{^}E)+¿ ,

whence, since P(E∨ ^E)<1 and 1−2 P(N^{^})<0 , it follows that P (E )>1−P(^{^}N)=P (^E) . Since P(^{^}E)<P(E) and P(N^{^})>P(N ) , and using constraint 3, we have

P(^{^}E

|

E)=P(E

|

E^{^})P( ^E)

P(E) <P(E

|

^{^}E)=P(N

|

^{^}N)<P(N

|

N^{^})P( ^N )

P(N ) =P(^{^}N

|

N). □

Proof of Proposition 9. In general, for any x, y with x+y=c, let f(x,y)=xy. For fixed c, maximizing f is the same as maximizing . This maximum is at x=c/2, where x=y.

(24)

Now the result for g-means follows when we take x=P(^{^}_E∨E) and y=P( ^N ∨N ) . The result for ^F1 follows by taking x=P(^{^}E∨E) and y=P(E∨^E) , after remarking that maximizing 2xy/c for fixed c is equivalent to maximizing f. By Proposition 2,

P(^{^}E∨E)=P(E∨ ^E) is equivalent to P(^{^}E)=P(E) . □

References

[1] Bishop CM. Pattern Recognition and Machine Learning (Information Science and Statistics).

Springer, 1st ed. edition 2007.

[2] Zhou X, Obuchowski NA, McClish DK. Statistical Methods in Diagnostic Medicine. Wiley, 2nd edition 2011.

[3] Ma S, Huang J: Penalized feature selection and classification in bioinformatics. Briefings in Bioinformatics 2008, 9(5): 392–403.

[4] Baek S, Tsai CA, Chen JJ. Development of biomarker classifiers from high-dimensional data. Briefings in Bioinformatics 2009, 10(5):537–546.

[5] Massague J. Sorting Out Breast-Cancer Gene Signatures. National England Journal of Medicine 2007, 356(3):294–297.

[6] Collins G, Mallett S, Omar O, et al. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Medicine 2011, 9:103.

(25)

[7] Majewski IJ, Bernards R. Taming the dragon: genomic biomarkers to individualize the treatment of cancer. Nature medicine 2011: 304–312.

[8] Simon R, Roychowdhury S. Implementing personalized cancer genomics in clinical trials.

Nature reviews Drug discovery 2013, 12(5):358–369.

[9] Kleftogiannis D, Kalnis P, Bajic VB. Progress and challenges in bioinformatics approaches for enhancer identification. Briefings in Bioinformatics 2015,

[[http://bib.oxfordjournals.org/content/early/2015/12/03/bib.bbv101.abstract]].

[10] Sorace JM, Zhan M. A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics 2003, 4:24.

[11] Leung F, Musrap N, Diamandis EP, et al. Advances in mass spectrometry-based technologies to direct personalized medicine in ovarian cancer. Advances in Integrative Medicine 2013, 1:74 – 86.

[12] Li L, Darden TA, Weingberg CR, et al. Gene Assessment and Sample Classification for Gene Expression Data Using a Genetic Algorithm / k-nearest Neighbor Method.

Combinatorial Chemistry 2001, 4(8):727–739.

(26)

[13] Oberthuer A, Berthold F, Warnat P, et al. Customized oligonucleotide microarray gene expression-based classification of neuroblastoma patients outperforms current clinical risk stratification. Journal of Clinical Oncology 2006, 24(31):5070–5078.

[14] Tan PJ, Dowe DL, Dix TI. Building Classification Models from Microarray Data with Tree-Based Classification Algorithms. In Australian Conference on Artificial Intelligence 2007:589–598.

[15] Wu J, Liu H, Duan X, et al. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics 2008, 25:30–35.

[16] Brown MP, Grundy WN, Lin D, et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Sciences USA 2000, 97:262–267.

[17] Speed TP. Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC 2003.

[18] Simon RM, Korn EL, McShane LM, et al. Design and Analysis of DNA Microarray Investigations. New York: Springer 2004.

(27)

[19] He H, Garcia EA: Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 2009, 21(9):1263–1284.

[20] Japkowicz N, Stephen S. The class imbalance problem: A systematic study. Intell Data Anal 2002, 6(5):429–449.

[21] MacIsaac KD, Gordon DB, Nekludova L, et al. A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin

immunoprecipitation data. Bioinformatics 2006, 22(4):423–429.

[22] Wang J, Xu M, Wang H, et al. Classification of Imbalanced Data by Using the SMOTE Algorithm and Locally Linear Embedding. In International Conference on Signal Processing, Volume 3 2006.

[23] Batuwita R, Palade V. microPred: effective classification of pre-miRNAs for human miRNA gene prediction. Bioinformatics 2009, 25(8):989–995.

[24] Blagus R, Lusa L. Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 2010, 11:523.

[25] Xiao J, Tang X, Li Y, et al. Identification of microRNA precursors based on random forest with network-level representation method of stem-loop structure. BMC

Bioinformatics 2011, 12:1-8.

(28)

[26] Doyle S, Monaco J, Feldman M, et al. An active learning based classification strategy for the minority class problem: application to histopathology annotation. BMC bioinformatics 2011, 12:1-14.

[27] Lin WJ, Chen JJ: Class-imbalanced classifiers for high-dimensional data. Briefings in Bioinformatics 2013, 14(1): 13-26.

[28] Lin J, Adjeroh D, Jiang BH: Probabilistic suffix array: efficient modeling and prediction of protein families. Bioinformatics 2012, 28(10):1314-1323.

[29] Chakraborty A, Chakrabarti S: A survey on prediction of specificity-determining sites in proteins. Briefings in Bioinformatics 2015, 16:71-88.

[30] duVerle DA, Mamitsuka H: A review of statistical methods for prediction of proteolytic cleavage. Briefings in Bioinformatics 2012, 13(3):337-349.

[31] Keshava Prasad TS, Goel R, Kandasamy K, et al. Human Protein Reference Database – 2009 update. Nucleic Acids Res. 2009, 37: D767-D772.

[32] Park Y, Marcotte EM: Revisiting the negative example sampling problem for predicting protein-protein interactions. Bioinformatics 2011, 27(21): 3024-3028.

(29)

[33] Daskalaki S, Kopanas I, Avouris N. Evaluation of classifiers for an uneven class distribution problem. Appl Artif Intell 2006, 20: 1-37.

[34] Blagus R, Lusa L. Improved shrunken centroid classifiers for high-dimensional class- imbalanced data. BMC Bioinformatics 2013, 14:1–13.

[35] Radivojac P, Chawla NV, Dunker AK, et al. Classification and knowledge discovery in protein databases. Journal of Biomedical Informatics 2004, 37(4):224 – 239.

[36] Taft L, Evans R, Shyu C, et al. Countering imbalanced datasets to improve adverse drug event predictive models in labor and delivery. Journal of Biomedical Informatics 2009, 42(2):356 – 364.

[37] Kim S, Choi J. An SVM-based high-quality article classifier for systematic reviews.

Journal of Biomedical Informatics 2014, 47:153 – 159.

[38] Li J, Li C, Han J, et al. The detection of risk pathways, regulated by miRNAs, via the integration of sample-matched miRNA-mRNA profiles and pathway structure. Journal of Biomedical Informatics 2014, 49:187 – 197.

[39] Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 2002, 16:341–378.

(30)

[40] Ahn H, Moon H, Fazzari MJ, et al. Classification by ensembles from random partitions of high-dimensional data. Computational Statistics and Data Analysis 2007, 51(12):6166–6179.

[41] Esfahani MS, Dougherty ER: Effect of separate sampling on classification accuracy.

Bioinformatics 2014, 30(2): 242-250.

[42] Braga-Neto UM, Zollanvari A, Dougherty ER: Cross-validation under separate sampling:

strong bias and how to correct it. Bioinformatics 2014, 30(32): 3349-3355.

[43] Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics 2007, 8:86–100.

[44] Pang H, Tong T, Zhao H. Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics 2009, 65(4):1021–1029.

[45] Fix E, Jr. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Rep. 4 1951.

[46] Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002, 97(457):77–87.

(31)

[47] Cortes C, Vapnik V. Support-vector networks. Machine Learning 1995, 20(3):273–297.

[48] Breiman L. Random forests. Machine Learning 2001, 45:5–32.

[49] Hess KR, Anderson K, Symmans WF, et al. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and

cyclophosphamide in breast cancer. J. Clin. Oncol. 2006, 24(26):4236–4244.

[50] Karlsson E, Delle U, Danielsson A, et al. Gene expression variation to predict 10-year survival in lymph-node-negative breast cancer. BMC Cancer 2008, 8:254.

[51] Garman KS, Acharya CR, Edelman E, et al. A genomic approach to colon cancer risk stratification yields biologic insights into therapeutic opportunities. Proc. Natl. Acad. Sci.

U.S.A. 2008, 105(49):19432–19437.

[52] Sonego P, Kocsor A, Pongor S. ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings in Bioinformatics 2008, 9(3):198–209.

[53] Berrar D, Flach P. Caveats and pitfalls of ROC analysis in clinical microarray research (and how to avoid them). Briefings in Bioinformatics 2012, 13:83–97.

[54] Pepe MS. The Statistical Evaluation of Medical Tests for Classification and Prediction. New York: Oxford University Press 2003.

(32)