Random Forests

(1)

Random Forests

S¨

oren Mindermann

September 21, 2016

Bachelorproject

Begeleiding: dr. B Kleijn

Korteweg-de Vries Instituut voor Wiskunde

(2)

Abstract

Tree-based classification methods are widely applied tools in machine learning that cre-ate a taxonomy of the space of objects to classify. Nonetheless, their workings are not documented to high degree of formalism. The present paper places them into the math-ematical context of decision theory, examines them from the perspective of Bayesian classification where the probability of misclassification is minimized and covers the ex-ample of handwritten character recognition. Tree classifiers exhibit low bias but high variance. Pruning of trees reduces variance by decreasing the complexity of the clas-sification model. The random forest algorithm further reduces variance by combining multiple trees and aggregating their votes. To make use of the Law of Large Numbers, a sample from the space of tree classifiers is taken. Both the tree construction and the selection of training data are randomized for this purpose. Given the a measure of the strength of individual trees and the dependence between them, an upper bound for the probability of misclassification of a random forest with an infinite number of trees can be derived.

Titel: Random Forests

Auteur: S¨oren Mindermann, soeren.mindermann@gmail.com, 10457003 Begeleiding: dr. B Kleijn

Einddatum: September 21, 2016

Korteweg-de Vries Instituut voor Wiskunde Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(3)

1 Introduction to Decision Theory

Many situations require us to make decisions under uncertainty. The first chapter will build a mathematical framework for such ’decision theoretic’ problems, first from a frequentist and then from a Bayesian point of view which builds the foundation for the theory of classification. In chapter 2, we will introduce tree classifiers, an applied algorithm that is used to classify objects. Chapter 3 then goes into an application of tree classifiers and examines random forests, the combination of randomized tree classifiers that improves the accuracy of classification.

1.1 Decision Theory

In the example above, decisions can be rated according to certain optimality criteria. For a severe illness such as cancer, a false positive diagnosis can be much more harmful than a false negative, which is different from a less severe illness. Such differences are not taken into account in the classical framework of statistical inference. In that framework, we work with concepts such as Type-I and Type-II errors as well as confidence intervals, but these don’t allow us to directly assign a value to how unfavorable certain errors are. Statistical decision theory solves this by introducing a loss-function, which quantifies the impact of an incorrect decision.

The mathematical set-up for decision-theoretic problems is similar to that of a regular problem of statistical inference with mostly semantic differences. Instead of a model we speak of a state-space Θ and an unknown state θ ∈ Θ. The observation Z takes a value in the sample space Z which is measurable with σ-algebra B. Z is a random variable with distribution Pθ : B → [0, 1] for a given state θ ∈ θ. We take a decision

a from the decison-space A based on the realization z of Z. There is an optimal decision for each state θ of the system, but since θ is unknown the decision made can be suboptimal. Statistical decision theory attempts to decide in an optimal way given only the observation Z. The decision procedure can be seen as a way of estimating the optimal decision.

The main feature that distinguishes decision theory from statistical inference is the loss-function.

Definition 1.1. A loss-function can be any lower-bounded function L : Θ ×A → R. The function −L : Θ ×A → R is called the utility-function.

Given a particular state θ of the system, a loss L(θ, a) is incurred for an action a. The loss can be positive or negative. In the latter case it is a profit, which usually signifies that the decision was optimal or near-optimal. Since we do not have θ but only the data Z to form the decision, a decison rule that maps observations to decisions is needed.

(5)

Definition 1.2. LetA be a measurable space with σ-algebra H . A measurable function δ :Z → A is called a decision rule.

The space of decision rules to consider is denoted by ∆. Clearly, the decision theorist’s goal is to find a δ that is optimal at minimizing the loss according to some principle. Again, the decision rule can be compared to the estimator in the case of statistical inference, which maps observations to estimates of a parameter in the frequentist case or beliefs about a parameter in the Bayesian case. Having set up the common foundation for decision theoretic problems we now examine both cases separately, starting with frequentist decision theory.

1.1.1 Frequentist Decision Theory

The frequentist assumes that Z is distributed according to a ’true’ distribution Pθ0

for some state θ0 ∈ Θ. A key concept for analyzing optimality of decison-rules is the

expected loss.

Definition 1.3. The risk-function R : Θ × 4 → R is defined as the expected loss under Pθ0 with δ as the decision rule.

R(θ, δ) = Z

Z

L(θ, δ(Z))dPθ0 (1.1)

Of course, only the expected loss under the true distribution is of interest to the frequentist. Still, we need a way to compare the risks of two decision rules based for any parameter θ. This motivates the following definition.

Definition 1.4. Given are the state-space Θ, the states Pθ(θ ∈ Θ), the decision space

A and the loss L. For δ1, δ2 ∈ 4, δ1 is called R-better than δ2 if for all θ ∈ Θ

R(θ, δ1) < R(θ, δ2) (1.2)

A decison rule δ is admissible if there exists no δ0 which is R-better than δ (and inad-missible otherwise).

The ordering induced on 4 by this definition is merely partial. It is possible that two decision rules δ1, δ2 exist such that none is R-better than the other. Therefor we define

the minimax decision principle which picks the decision rule that is optimal in the worst possible case.

Definition 1.5. Let ∆ be the state-space, Pθ(θ ∈ Θ) the states, A the decision space

and L the loss. The minimax risk is given by the function ∆ → R : δ 7→ sup

θ∈Θ

R(θ, δ). (1.3)

Let δ1, δ2 ∈ 4 be given. Then δ1 is minimax-preferred to δ2 if

sup θ∈Θ R(θ, δ1) < sup θ∈Θ R(θ, δ2). (1.4) If δM _{∈ 4 minimizes δ 7→ sup}

(6)

The existence of such a δM is guaranteed by the minimax theorem under very general conditions which can be found in [1].

1.1.2 Bayesian Decision Theory

In Bayesian decision theory we incorporate the prior by integrating over ∆. This ap-proach is more balanced than minimizing the maximum of the risk function over Θ because a prior is given over Θ. Taking it into account does not only take the worst possible state into account, making it less pessimistic than the minimax decision rule. Definition 1.6. Let ∆ be the state-space, Pθ(θ ∈ Θ) the states, A the decision space

and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G → R. The Bayesian risk function is given by

r(Π, δ) = Z

Θ

R(θ, δ)dΠ(θ). (1.5) Furthermore, let Let δ1, δ2 ∈ 4 be given. Then δ1 is Bayes-preferred to δ2 if

r(Π, δ1) < r(Π, δ2). (1.6)

If δΠ minimizes δ → r(Π, δ) then δΠ is called a Bayes rule. r(Π, δΠ) is called the Bayes risk.

The Bayes risk is always upper bounded by the minimax risk because the minimax risk equals the Bayes risk for a pessimistic prior where all probability mass is on a state θ ∈ Θ that maximizes the risk function.

Definition 1.7. Let ∆ be the state-space, Pθ(θ ∈ Θ) the states, A the decision space

and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G → R. We define the decision rule δ∗ :Z → A to be such that for all z ∈ Z ,

Z Θ L(θ, δ∗(z))dΠ(θ|Z = z) = inf a∈A Z Θ L(θ, a)dΠ(θ|Z = z). (1.7) In other words, δ∗(z) minimizes the posterior expected loss point-wise for every z.

Given that δ∗ is defined as a point-wise minmizer, the question arises whether it exists and is unique. These cannot be proven in the general case. However, given the existence of δ∗, the following theorem establishes that it is indeed a Bayes-rule.

Theorem 1.8. Let ∆ be the state-space, Pθ(θ ∈ Θ) the states, A the decision space

and L the loss. Assume that Θ is measurable with σ-algebra G and a prior Π : G → R. Assume there is a σ-finite measure µ :B → R such that Pθ is dominated by µ for all

θ ∈ Θ. If the decision rule δ∗ is well-defined, then δ∗ is a Bayes rule.

The proof uses the Radom-Nikodym theorem as well as Fubini’s theorem and is given in Kleijn [1].

(7)

1.2 Classification

A problem of classification arises in the context where there is a probability space (Ω,F , P ) containing objects to classify, a set C = {1, 2, ..., c} of classes and a ran-dom variable Y : Ω → C that maps each object to a class. Furthermore, each object has observable features, which gives rise to a random variable V : Ω → V , where V = {0, 1}M _{is the feature space and M the total number of features. In what follows,}

the feature space can be ignored in most cases. In the language of decision theory, C is the state-space. The problem to solve is that the class Y (x) is not known for each object x ∈ Ω.

Therefore, a decision rule ˆY : Ω → C , normally called classifier, is required. ˆY can be seen as an estimate for the true class function Y . ˆY can also be written as a function V → C.

A loss function L :A × A → R is given. When not specified we define it by default as as

L(k, l) = 1k6=l, (1.8)

i.e. there is a loss of 1 for each misclassification. According to the the minimax decision principle we look for a classifier δM :Z → A that minimizes:

δ → sup k∈C Z Z L(k, δ(z))dP (z|Y = k) = sup k∈C P (δ(Z) 6= k)|Y = k), (1.9) in other words, the principle demands that the probability of misclassification is mini-mized uniformly over all classes.

The Bayesian analog of the prior on the decision spaceA is a prior on the class space C . The prior is given by the marginal distribution of Y . For practical purposes, it can often be approximated with the relative frequencies of the classes {1, 2, ..., c} in a sample. Otherwise, equal probability is assigned to each class. Given the probabilities P (Y = k) the prior is thus defined with respect to the counting measure as:

π(k) = P (Y = k). (1.10) For classification problems, the Bayes rule is the minimizer of

δ 7→X k∈C L(k, δ(z))dΠ(k|Z = z) = c X k=1 Π(δ(y) 6= Y |Z = z) (1.11) for every z ∈ Z . By theorem 1.8, δ∗ minimizes the Bayes risk which is given by the probability of misclassification in this case:

r(Π, δ) =X k∈C R(k, δ)π(k) =X k∈C Z Z L(k, δ(y))dP (z|Y = k)π(k) =X k∈C P (k 6= δ(Z)|Y = k)P (Y = k) = P (K 6= δ(Z)).

(8)

Thus, in Bayesian classification, as opposed to the minimax classification, we minimize the overall probability of misclassification without referring to the class of the object.

(9)

2 Classification with Decision Trees

Classification trees are classifiers, i.e. they map an object to a class based on the features of that object. In the context of this chapter, a (rooted) tree is built based on a set of labelled training data l = {x1, x2, . . . , xn} (where xk ∈ Ω for k ≤ n) with known class

labels Y (x1), Y (x2), . . . , Y (xn). This allows it to classify data from a general population

Ω where the training data stem from. A tree classifies objects by creating a taxonomy, i.e. a partition of the feature space V . At each node, the training data are split into a subset that does and a subset that does not have a certain feature.

We work in a Bayesian set-up, i.e. we try to minimize the probability of misclassifi-cation. However, no Bayesian updating procedure will be considered. Note that there is usually an unmanageable amount of possible decision rules. Since the Bayes classifier is the minimizer of the misclassification probability and there is no general procedure to find it in a reasonable amount of time, it is usually intractable to find the Bayes classifier.

This chapter describes the construction of tree classifiers as an example of supervised learning, introduces the trade-off of bias and variance as well as overfitting and then examines methods of reducing variance.

2.1 Supervised learning

In supervised learning, a set of training data l = {x1, . . . , xn} where x1, ..., xnare assumed

to be an i.i.d. sample from Ω, is given alongside class labels Y (x1), . . . , Y (xn). Then a

function T analyzes l and maps it to a classifier t : Ω → C such that t can generalize to different input, i.e. such that it can classify new examples. Many techniques exist to construct such classifiers. In what follows, we will focus on decision trees.

2.2 Tree classifiers

A tree classifier, or ’decision tree’ t, built based on training data l, classifies an object x by asking a series of queries about the object’s feature vector. In the case at hand, each object is assumed to be fully defined by a vector of binary features. Thus, Ω = {0, 1}M_,

where M is the total number of features. The classification function Y : Ω → C is deterministic, contrary to the default set up in 1.2. Each node in the tree represents a query asked and depending on the answer the object moves to one of the (usually two) children, where another query is posed. The leaves thus represent a partition of the feature space. This recursive partitioning is also called taxonomy. An object x ends up

(10)

at a leaf as it decends down the tree t. Since each leave is assigned a class label, the tree can then be seen as a random variable t : Ω →C .

2.2.1 Construction of tree classifiers

As we will see later on, the number of possible trees grows exponentially with the number of features in Ω. Therefore, it is normally infeasible to find the optimal tree given a set of training data l. Nonetheless, algorithms have been developed to find reasonably accurate trees. A general class of algorithms, Hunt’s algorithm, works according to the following recursive definition for a given node m and a subset lm of l which arrives at the node:

• Step 1: If all objects in lm have the same class, m is a leaf.

• Step 2: If not all objects in lm belong to the same class, a query is selected which

further splits lm into child nodes, based on the answer to the query.

It is possible that one of the child notes created in step 2 is empty, i.e. none of the training data is associated with the node. In this case, the node will be a leaf.

Hunt’s algorithm leaves two questions open which are crucial for the design of a tree classifier. Firstly a splitting criterion has to be selected in step 2. Secondly, a stopping criterion other than the one in step 1 should be defined. It is possible that a large number of splits is needed before the remaining training data belong to the same class. N splits would mean that the tree has at least 2N_{− 1 nodes if the process is not stopped}

at any branch. Depending on the number of training data, the amount of available computing power and the accuracy of the tree needed, the tree can quickly grow to too large proportions and become computationally unmanageable.

Splitting criterion

We will therefore set up a framework for analyzing the probability distributions at each node. The joint probability distribution over Ω ×C at node m is taken to be known, as it can be estimated by the empirical distribution of the examples at m. Then there is a marginal distribution Pk(·|m). Let A be a binary feature to split on which can take

values a0 = 0 or a1 = 1. This gives a new distribution

P (k|m, A = ai) =

P (A = ai|m, k)P (k|m)

P (A = ai|m)

(2.1) over classes at each leaf (by Bayes’ Theorem). All of these probabilities are estimated by the empirical distributions of the training data.

It should be noted that trees can split on binary, nominal, ordinal (discrete) or contin-uous features. For simplicity we focus on binary features, although the analysis can be extended to other attribute types. Note that nominal features and (finite-dimensional) ordinal and continuous features can be approximated by multiple binary splits.

Let P (k|m) denote the probability that x has class k given that it ends up at node m. To classify x with maximal accuracy we would like this probability to be as close as

(11)

possible to 0 or to 1 all else being equal. The least favorable case would be a uniform distribution over both classes.

This intuition is formalized via ’impurity’ measures of a distribution. A common requirement for a measure of impurity is that it is zero when one class has all probability mass and maximal when the class-distribution is uniform. Three common measures are the Shannon entropy

H(m) = −

c

X

k=1

P (k|m) log₂P (k|m) (2.2) where we define 0log20 = 0, the Gini index

G(m) =X i6=j P (i|m)P (j|m) = 1 − c X k=1 [P (k|m)]2 (2.3) and the classification error

P E∗(m) = 1 − max

k P (k|m) (2.4)

The entropy can be understood as the expected amount of information (for some information measure) contained in any realization k from a random object drawn from the population at m. The Gini index can be viewed as the expected error when the class is randomly chosen fromC with distribution P (·|m).

The feature to split on is then selected at each node in a locally optimal way, namely such that it minimizes the remaining impurity. We will cover three common measures below. Rather than merely minimizing the impurity of the data at one child note of m, the splitting criterion should minimize the expected impurity across all child notes. Minimizing the entropy or the Gini index is not the optimal way of splitting just before the leaves. As we have seen in chapter 1, the Bayesian classifier picks the class k with the highest probability P (k|m).

If the child nodes of the node m will be leaves it makes sense to minimize the clas-sification error itself. However, it only maximizes the maximal probability of one class whereas the other measures offer a more general reduction of impurity across classes. Entropy and Gini index are therefore more useful for also minimizing impurity several splits onward.

It is easy to see that all measures are indeed zero when the probability is concentrated on one class. Figure 1 shows that they are maximal for a uniform distribution when there are two classes.

Finally, the feature A with possible values a1, ..., ad to split on is selected as the

maximizer of the expected loss of impurity in the impurity measure i or the ’information gain’, i(P (.|m)) − d X k=1 P (ak|m)i(P (.|ak, m)). (2.5)

When the classification problem is binary and the training data are split into two children at each split, the impurity measures can also be visualized as a function of pi, the fraction

(12)

Figure 2.1:

Comparison of impurity measures for binary classification problems [6].

of objects with class i, as visualized in figure 1. Note that the uniform distribution is indeed the value p1 = 0.5 which maximizes all three measures and both p1 = 1 and

p2 = 1 indeed minimize all measures as required.

Empirical studies have demonstrated that different measures of impurity yield similar results. This is due to the fact that they are quite consistent with each other [4], as figure 2.1 suggests. Another feature revealed here is that these functions are concave.

The following proposition assures that the information gain is always non-negative when the impurity i(P ) is strictly concave.

Proposition 2.1. LetP be the space of probability measures on the space C , endowed with the σ-algebra P(C ), the power set of C . Let i : P → R be a strictly concave impurity measure. Then a split on any feature A results in a non-negative information gain. The information gain is zero if and only if the distribution over classes P (·|A = ai)

is the same in all children.

Proof. We denote by Pm the distribution P (k|m). Now Jensen’s inequality assures that

for concave i,

X

i=0,1

P (ai|m)i(P (k|ai)) = EA|mi(P (.|A))

≤ i(EA|mP (.|A))

= i(X

i=0,1

P (ai|m)P (.|ai))

= i(P (.|m))

(13)

with equality if and only if P (k|ai) = P (c) for all i and c.

Impurity can also be viewed at the level of the entire tree as I(T ) = X

m∈M (T )

P (m)i(P (·|m)), (2.7) where M (T ) is the set of leaves of T . If the impurity measure taken is the classification error then this gives the classification error on the training set, or the general probability of misclassification if P is not estimated by the empirical distribution of the training data at each node.

2.2.2 Bias and Variance

In supervised learning two types of error are distinguished. In the context of classification the training error describes the share of misclassifications made on the training data,

ET = 1 |l| X x∈l 1T (x)6=Y (x), (2.8)

where l = {X1, X2, ..., Xn} are the training data, drawn i.i.d. from Ω. By contrast, the

generalization error refers to the expected number of misclassifications on new data. Definition 2.2. The generalization error of a classifier T : Ω →C , denoted P E∗ is

P E∗ = P (T (X) 6= Y (X)) = EXL(Y (X), T (X)), (2.9)

where L is the loss function L(k, l) = 1k6=l.

Bias and variance can be explained in the context of the space of classifiers when there is a norm || · || on this space. Usually the modelT , which includes all classifiers under consideration, does not include the true classification function Y . Assuming that T is closed and convex, there exists a eT ∈ T which minimizes ||T − Y ||. The model bias ise then defined as || eT − Y ||. Since the algorithm for finding a classifier does not necessarily find eT , the chosen classifier ˆT varies around eT . This allows us to define variance as E|| ˆT − eT ||2.

A natural consequence from this perspective is that larger models will exibit a smaller bias as || eT − Y || is likely to be smaller. However, they also allow for greater fluctuation of ˆT around eT , increasing variance. A large model is given when, e.g. there is a large number of parameters. As this is the case for deep trees, they tend to have high variance. Since no default norm on T is actually given, bias and variance are hard to quantify. Following the usual definition, the bias would be defined as EXd(Y (X), T (X)), the

expected distance between the classification function and the classifier in some metric d. Neither Amit and Geman nor Breiman specify the metric and instead bias is taken to be the expectation of the training error EX1,X2,...,XnET.

(14)

The error due to variance is the amount by which the prediction, over one training set, differs from the expected predicted value, over all possible training sets. In cases where T (X) takes a numerical value this can be written as EX(T (X) − ELT (X))2. However,

given that the class label is not generally a numerical variable, neither the expected class nor the difference between classes are clearly defined. Amit and Geman [3] as well as Breiman [5] therefore do not quantify variance or simply define it as P E∗−E2

T. Variance

can therefore be seen as the component of generalization error that stems from feeding the classifier previously unknown data.

2.2.3 Model overfitting

When variance increases too much due to a large model, the result is considered overfit-ting. When grown too large, tree classifiers are prone to overfitting as they have many parameters which contribute to a large model.

On the one hand the tree-building procedure can continue until every unique example in the training set ends up in its own leaf. By assigning each leaf the class of its particular example, the tree can classify the whole training set correctly. On the other hand a tree built this way will depend strongly on the training data used so variance will be high. Thus, when a tree is grown larger, the training error (or bias) decreases while variance tends to increase as displayed in figure 2.2. From this observation we can conclude that there is an optimal amount of model complexity which minimizes the generalization error. While low bias is a necessity for sufficiently large trees, the increase in variance cannot be rigorously shown in the general case and there is no general agreement on the precise reasons for overfitting [3].

2.2.4 Pruning

As variance is a key challenge in decision tree learning, methods to reduce a tree’s complexity while preserving accuracy on the training set are highly sought-after. Pruning is one such method. It works by removing the branches of a fully-grown tree which add little to reducing bias but do increase variance. We will discuss pre-pruning and cost-complexity pruning (the most well-known form of pruning).

Pre-pruning reduces the size of a tree by specifying a stopping criterion (for an overview, see [12]). By default, a node is not split further if all examples at the node be-long to the same class. This criterion is replaced by the stopping criterion (or whichever comes first). Pre-pruning is more computationally parsimonious because it prevents the further induction of a tree and does not have a post-pruning step. Ideally, pre-pruning methods stop the induction when no further accuracy gain is expected. However, in practice they filter both relevant and irrelevant splits. For this reason, they have mostly been abandoned except for large-scale applications where the training set is too large to grow a full tree. Various stopping criteria are possible, e.g. Amit and Geman [3] stop the induction when the number of examples drops below a threshold at each leaf.

Regular pruning (also called post-pruning), reduces the a fully-grown treee T to a smaller tree T0 by removing sub-trees and replacing them with a leaf. Minimal

(15)

cost-Figure 2.2:

Relationship of bias, variance, generalization error and model complexity [6]. complexity (MCCP) pruning, proposed by Breiman et al. in 1984 [9] is an early and interesting as well as widely used method we will examine in detail.

Let T be the space of possible trees, L ⊂ Ω be any set of test examples to evaluate the performance of a tree and assume that the class labels of the examples in L are known. Furthermore, let s(T ) be the size of a tree, measured by the number of its leaves. The map s is proportional to the number of nodes which equals 2s − 1. Then, let R : T → R be a measure of a tree’s performance that averages over the contributions from the leaves,

R(T ) = X

m∈M (T )

R(m) (2.10)

where M (T ) is the set of leaves of T .

Breiman et al. let R be the number of misclassifications on the training set or a test set. These measures too can be expressed in the form of (2.10) by letting R(m) be the number of examples that are misclassified at node m. When estimating P E∗ a separate test set of labelled data is used. Such a set may be called either pruning set or cross-validation set depending on whether it is also used to estimate the generalization error of the eventual pruned tree. R(m) is then defined as the number of examples from the test set arriving at node m that are misclassified. Similarly, if R were the entropy then R(m) is the entropy of the distribution P (k|m), estimated by ePL(k|m) as usual.

The cost-complexity function is then defined as

(16)

where α > 0 scales the penalty assigned to more complex trees in terms of the number of leaves s(T ). A method to choose α will be discussed. The MCCP chooses a subtree T of the full tree T0 which minimizes Rα.

Naturally, α = 0 should be used if R(T ) is the generalization error, estimated with a test set, Rα is supposed to approximate P E∗. The size of a tree need only be penalized

because it increases the generalization error. Breiman et al. showed that for the full tree T0 there is a set of nested trees Tkwhich minimize Rα on the interval [αk, αk+1) for some

partition of [−∞, ∞] into such intervals (with {αk} ⊂ [−∞, ∞] increasing). Moreover,

as we will show, a simple algorithm can find such a sequence. There may be multiple subtrees of T0 which minimize Rα. If one of these is a sub tree of all others, call it T (α).

We will consider see an algorithm to find T (α).

Let T be a tree with more than 1 node and let Tm be the sub-tree of T0 with root m.

We write R(m) for the R-measure of the node m when T is pruned at m. Similarly, s(m) is always 1. Define the function g which compares the increase in R with the reduction in size when pruning at m as

g(m, T ) = R(Tm) − R(m) s(m) − s(Tm)

. (2.12)

Note that when g(m, T ) > α then Rα(m) > Rα(Tm) and vice-versa.

For a rooted tree T , define the set of children of a node m as all the node which are connected to the root via m and one edge further away from the root than m. Then for a rooted tree Tm, the set of branches of Tm is defined as B(Tm) = {Tb : b is a child of m}.

From (2.10) it follows that we can write Rα(T ) =

P

Tm∈B(T )Rα(Tm). For the proof of

the following proposition it is important to note that any node in the original tree T0

either has more than one child or none by Hunt’s algorithm (see 2.2.1). The same is true for a pruned sub-tree of T0 because pruning at a node removes all of its branches.

This guarantees that any subtree T0 of a tree T with the same root node as T has as many branches as T if it has been created by pruning T .

Proposition 2.3. Let α ∈ R and create an enumeration of the nodes of a tree T such that each parent node comes after its child nodes. Consider each node in order and prune at node m if Rα(m) ≤ Rα(Tm) for the remaining tree T0. The resulting tree is T (α).

Proof. Firstly, a tree T with root node m will be called optimally pruned if any sub-tree T0 that can be created by pruning T and that is also rooted at m has a strictly larger value of Rα. This is what we denote by T (α).

Now let T be a tree and let mn, n = 1, . . . , 2s(T ) − 1 be an enumeration of its nodes

such that each node precedes its parent. We prove the proposition by induction on n. Tm1 is optimally pruned since it is a single node. Let n > 1 and assume that all

trees T_m0

k are optimally pruned for k < n in the current tree T

0_{. There are two options.}

If Rα(mn) ≤ Rα(Tm0n) we prune at mn, otherwise T

0

mn remains. In the first case the

resulting tree is trivial and therefore optimally pruned. In the second case, suppose a strict sub-tree T_m00_n of T_m0_n could be created by pruning T_m0 _n such that Rα(Tm00n) ≤

Rα(Tm0 n). T

00

(17)

T_m00_n has the same number of branches as T_m0 _n. Note that Rα(Tm0n) =

P

b∈B(T0

mn)Rα(b)

because Rα(T ) is a sum over the leaves of T . But then there must be a branch b ∈ B(Tm0 n)

such that Rα(b00) < Rα(b0). By the assumption that Tm00n can be created by pruning T

0 mn,

b00 can be created by pruning b0.

This is a contradiction with the assumption that all branches of Tmn were optimally

pruned. It follows that Tmn is optimally pruned and by induction the algorithm yields

an optimally pruned tree.

Given a value of α this allows us to find T (α). We now show how to find the sequence αk and the tree sequence Tk. From now on we assume that s is monotonely increasing

in the number of nodes of T .

Proposition 2.4. Let T be a tree and let α1 be the minimum of g(m, T ) for any node

m of T that is not a leaf. T is optimally pruned whenever α < α1. When pruning every

node m with g(m, T ) = α1, the result is T1 = T (α1). Furthermore, g(m, T ) > α1 for

every non-leaf node of T1.

Proof. T is automatically optimal for α < α1 from Proposition 2.3. If g(m, T ) > α

for every non-leaf m then Rα(m) > Rα(Tm) so no node is pruned. Therefore, the T is

already optimal. Now let α = α1 and prune according to Proposition 2.3. Each time we

prune at a node m, Rα(Tc) is not changed for any node c upstream of m

Thus Rα(m) ≤ Rα(Tm0 ) for the current tree T

0 _{if and only if R}

α(m) ≤ Rα(Tm) for the

original tree, which is equivalent to g(m, T ) ≤ α1. So the algorithm in Proposition 2.3

is equivalent to pruning each node with g(m, T ) ≤ α1. Hence, the latter also results in

T (α1).

Lastly, let m be a node that remains after pruning with α = α1. g(m, T1) > α1 follows

from, Rα1(m) − Rα1((T1)m) = Rα1(m) − Rα1(Tm) + [Rα1(Tm) − Rα1((T1)m)] = Rα1(m) − Rα1(Tm) = g(m, T )[s(Tm) − s(m)] > α1[s(Tm) − s(m)] ≥ α1[s((T1)m) − s(m)]

For the following proposition, recall that given an initial tree T0, T (α) is a sub tree that

minimizes Rα and is a sub tree of all other sub trees of T0 that also minimize Rα.

Proposition 2.5. Let β > α. Then T (β) is a sub-tree of T (α) and T (β) = T (α)(β). Proof. Enumerate the nodes of T as in the proof of proposition 2.3. We show by in-duction that Tm(β) is (weakly) a subtree of Tm(α) and therefore Tm(β) is a subtree of

(18)

For n = 1, Tm1 is a leaf so the claim holds. Let n > 1 and assume that for k < n,

Tmk(β) is a subtree of Tmk(α). At node mn we prune Tmn(α) if Rα(m) ≤ Rα(Tmn(α)

and equivalently for Tmn(β). We consider two cases. If Rα(m) > Rα(Tmn(α), Tmn(β) is

automatically a subtree of Tmn(α) because either Tmn(β) is trivial or all of its branches

are subtrees of the corresponding branches in Tmn(α) by the induction hypothesis. If

Rα(m) ≤ Rα(Tmn(α)) we need to show that Rα(m) ≤ Rα(Tm(α)) so that both Tmn(α)

and Tmn(β) are trivial.

Now Tmn(α) minimizes Rα over subtrees of Tmn so Rα(Tmn(α)) ≤ Rα(Tmn(β)). Thus

we have, Rβ(m) = Rα(m) + (β − α)s(m) ≤ Rα(Tmn(α)) + (β − α)s(m) ≤ Rα(Tmn(β)) + (β − α)s(m) = Rβ(Tmn(β)) + (β − α)[s(Tmn(β) − s(m)] ≤ Rβ(Tmn(β)).

Finally, T (β) = T (α)(β) because T (β) minimizes Rβ over all rooted subtrees of T which

includes the rooted subtrees of T (α).

These propositions suffice to find the sequence ak and the corresponding trees T (αk).

Having found T (α1), the algorithm in proposition 2.3 can be applied to this tree to

find T (α2) (because g(m, T (α1)) > α1 for each non-leaf of T (α1) by proposition 2.4)

and so forth. Eventually, T (αp) will be the trivial tree for some p. For αi−1 < α < αi,

i = 1, . . . , p, T (α) = T (αi−1) by proposition 2.4 and 2.5. Similarly, proposition 2.5 shows

that T (α) = T (αp) for α > αp.

Now that we can find the correct tree for each value of α, a reasonable value has to be found. In general, we seek to minimize the generalization error (although in some cases computation time should be minimized simultaneously). If we use a test set which is drawn i.i.d from Ω, independently from the training set, and let R be the error rate, R(T ) already forms an unbiased estimate of the generalization error. We are only penalizing size because it contributes to variance which in turn contributes to generalization error so in this case we can leave α = 0. If a test set were i.i.d. drawn from Ω, the expectation of R(T (α)) would be the generalization error of T (α). Absent a test set, we therefore need to estimate this expectation with minimal bias and select α to minimize it. Breiman et al. suggest using cross-validation. The training set is split into a partition J and for each part j ∈ J a tree T is constructed from the remaining |J | − 1 parts along with the corresponding sequence ak and the trees T (ak). This gives a piecewise constant function

rj(α) = R(j, T (α)) . They then compute the average function α 7→ P_j∈J rj(α)/|J | and

select α as the minimizer of that function.

Since the training set is disjoint from the test set for each j, rj(α) is expected to be a

reasonably unbiased estimate for the expectation of R(T (α)) (where the training set and test set are i.i.d. drawn from Ω). Averaging |J | cross-validation experiments reduces the variance of this estimate. If |J | is small, the size of the training set is considerably smaller (e.g. to 2/3 of its size for |J | = 3 as used by Breiman et al.) so the resulting trees

(19)

may have a higher error-rate on the test set. But this does not mean that the relative estimates of R(T (α)) for different α are seriously biased. Therefore, the minimizing value of α does not necessarily change.

For reference, a survey on decision tree-pruning methods to avoid overfitting is given by Breslow and Aha [12] and Esposito et al. [13]. Some of the other typical pruning methods include reduced error pruning, pessimistic error pruning, minimum error pruning, critical value pruning, cost-complexity pruning, and error-based pruning. Quinlan and Rivest proposed using the minimum description length principle for decision tree pruning in [7].

(20)

3 Random Trees and Forests

The final chapter will start with section 3.1 on bootstrapping, a commonly used tech-nique in applied statistics that illustrates the principle behind the random selection of training data that will be used in section 3.3. We continue with section 3.2 which in-troduces an application of trees and goes into randomization of the feature selection. Finally, section 3.3 covers the random forest algorithm.

3.1 Bootstrapping

We start this chapter with an introdution to bootstrapping, a technique that will later be used on the training data of random forests. We illustrate with an example. Given i.i.d. random variables X1, X2, . . . , Xn ∼ D we want to estimate the standard deviation

of a statistic ˆθ(X1, X2, . . . , Xn). In a typical case, ˆθ might for example be the the sample

standard deviation which has a distribution that is often hard to derive analytically. Writing the standard deviation of ˆθ as σ(D, n, ˆθ) = σ(D) shows that it is merely a function of the underlying distribution D as n and ˆθ are given parameters. The bootstrap procedure, which we will elaborate on shortly, yields an arbitrarily accurate estimate of σ(·) evaluated at ˆD, the empirical sample distribution of the realizations of X1, X2, . . . , Xn. Since ˆD is the non-parametric maximum likelihood estimate of D

according to Efron [11], σ( ˆD) serves aas an estimate for σ(D). The estimate is not necessarily unbiased here but yields good results in practice [11].

The function σ(·) usually cannot be easily evaluated analytically. However, the boot-strap procedure below can generate arbitrarily close estimates using the Law of Large Numbers.

1. Calculate the empirical distribution ˆD of the given sample with distribution func-tion ˆ D(t) = 1 n n X i=1 1Xi≤t. (3.1)

2. Draw an i.i.d. ’bootstrap’ sample

X₁∗, X₂∗, . . . , X_n∗ ∼ ˆD (3.2) from ˆD and calculate ˆθ∗ = ˆθ(X₁∗, X₂∗, . . . , X_n∗).

(21)

3. Repeat step 2 a large number B of times independently. This yields bootstrap replications ˆθ₁∗, ˆθ₂∗, . . . , ˆθ_B∗. Finally, calculate

ˆ σ( ˆF ) = 1 B − 1 B X b=1 [ˆθ_b∗− ˆθ∗_M]2, (3.3) where ˆθ_M∗ refers to the mean value of the bootstrap replications.

As 3.3 is the sample standard deviation of the random variable ˆθ∗ from step 2, the Law of Large Numbers guarantees that it converges ˆD-a.s. to the standard deviation σ( ˆD) of ˆθ∗. This proof is omitted.

3.2 Randomized trees for character recognition

This section examines the classification problem of character recognition and discusses results from Amit and Geman [3] who applied feature randomization to classify images of characters with tree classifiers. Their method has two advantages: it makes tree classification possible when the set of features is unmanageably large and reduces the dependence between distinct trees (measured by 3.7), making it more useful to combine multiple trees.

3.2.1 Introduction

Images of handwritten or machine-printed characters commonly have to be recognized for purposes of recognizing text. Examples include the automatic reading of addresses on letters and scanning of books and assistive technology for visually impaired users. A classifier program analyzes the features of an image of a character and classifies it as a particular character so that a digital text is produced. Artificial neural networks including the Adaboost algorithm as well as tree classifiers are well-known methods for character recognition. Amit and Geman introduced a new approach in 1997 [3] by inducing a tree classifier and randomly selecting the features to consider splitting on during the induction.

Character recognition has various properties that are relevant for choosing the right classification approach. Including special symbols there may be hundreds of classes and substantial variation of features within a class. Given that the images are binary with M pixels, the space of objects X has 2M _{elements and the feature space} _{V is}

correspondingly large. Due to the large feature space, the challenge is to navigate it efficiently when searching for optima.

This chapter starts with a description of the particular features that have to be ex-tracted from binary images in section 3.2.2. In section 3.2.3 we introduce the particular form of selecting splits for character recognition and 3.2.4 explains how and why the splitting features are randomized.

(22)

3.2.2 Feature extraction

One of the main challenges for the classification of character images is the selection of the right kind of features that are used to distinguish the images. Amit and Geman extract particular features from pixel images by assigning a label to each pixel and letting the feature to split on be a particular geometric arrangement of tags.

Each pixel is labelled with a 4x4 pixel grid which contains the pixel at the top left corner. Since there are 216 _{possible subimages types, they use a decision tree on a}

sample of images to narrow the space of subimages down to a set S of 62 tag types. The criterion for splitting at node t is dividing the 4x4 pixel subimages at the node as equally as possible into two groups. The resulting tags can losely be described as ’all black’ or ’white at the top, black at the bottom’, etc.

The eventual classification tree uses splits on geometric arrangements of such tags. These tag arrangements constitute the feature spaceV which is constructed as follows. Any feature is a tag arrangement which is a labelled directed graph. Each vertex labelled with a tag s ∈ S and each edge is labelled with a type of relation that represents the relative position of two tags. The eight types of relations correspond to the compass headings ’north’, ’northeast’, ’east’, etc. More formally, two vertexes are connected by an edge that is labeled k ∈ {1, 2, . . . , 8} if the angle of the vector u−v is in [(k−1₂)π₄, (k+1₂)π₄] where u and v are the two locations in the image. Let A be the set of all possible arrangements with at most 30 tags, thenV = {vA: A ∈A } is the feature space, where

vA: Ω → {0, 1} indicates whether a given object contains the arrangement A. This still

leaves an unmanagably large number of features to consider splitting on at each node. The next section explains how we can navigate the feature space more efficiently.

3.2.3 Exploring shape space

As the feature space consists of graphs, there is a natural partial ordering on it: A graph preceeds any of its extensions where extensions are made by successively adding vertexes or labelled relations. Small graphs produce rough splits of shape space. More complex ones contain more information about the image, but few images contain them. Therefore, Amit and Geman start the tree induction by only considering small graphs, which they call binary arrangements, and search for the best split among its extensions. As we will see later, this makes sure that at each node, only the more complex arrangements will be considered which are likely to be contained in an image at that node.

A binary arrangement is a graph with only one edge and two connected vertexes. The set of binary arrangements will be denoted B ⊂ V . A minimal extension of an arrangement A is any graph created by adding exactly one edge or one vertex and an edge connecting the vertex to A.

Now the tree is induced following the recipe from section 2.2.1. At the root, the tree is split on the binary arrangement A0 that most reduces the average impurity I.

Amit and Geman use the Shannon entropy as the uncertainty measure as opposed to the classification error or the Gini index. At the child node which answered ‘no’ to the chosen arrangement, B is searched again and so forth. At the other child node,

(23)

the minimal extension of A0 which most reduces the uncertainty is chosen as the split.

Continuing this pattern, at each node a minimal extension of the arrangement that last answered ‘yes’ is chosen as the splitting criterion. When a stopping criterion is satisfied, the algorithm stops. Amit and Geman stop the process when the number of examples at a leaf is below a threshold which they do not further specify.

3.2.4 Feature randomization

Both pre-pruning and considering only binary arrangements and minimal extensions at each node restrict the computational resources needed considerably, but not sufficiently. At each node the amount of features to consider is still very large - both the number of new binary arrangements and the number of extensions of arrangements. Amit and Ge-man suggest to simply select a uniformly drawn random subset of the minimal extensions (or binary arrangements if none has been picked yet) to consider at each node.

This randomization also makes it possible to construct multiple different random trees from the same training data l. Let the trees t1, t2, . . . , tN be a random sample from a

probability space (T , A , P_T) of trees with some probability distribution P_T. If there was a norm || · || on T , the Law of Large Numbers guarantees that the sample mean would converge to the population mean of the trees constructed with L as long as the method of random feature selection generates independent, representative samples from this distribution.

Randomization also explains why the variance of the expected loss decreases as N increases. Let G(t) denote the generalization error of a classifier t. This is equivalent to the expected loss as given in 2.2. Assuming that tk are i.i.d. realizations of a random

classifier T , representatively sampled fromT , 1 N N X k=1 G(tk) → ETG(T ) (3.4)

as N → ∞ by the Law of Large Numbers. The same holds for any other function H(T ) as long as its expected value exists and is finite. Although this does not proof that the combination of multiple trees has a lower expected generalization error than a single tree, the reduction in variance is suggestive because much of the error for single trees is due to high variance [3].

3.3 Random Forests

As discussed in section 2.2.2, tree classifiers are prune to overfitting, i.e. lowering bias at the expense of high variance. The larger the tree, the more likely overfitting becomes. Breiman [5] offers a solution to this problem by combining multiple randomized trees and having them vote on the classification. He terms this algorithm Random Forest.

Breiman uses two forms of randomization to create multiple trees given a training set. Firstly, the same feature randomization discussed in the previous section and secondly,

(24)

a different random subset of the training data is used for each new tree - a form of bootstrapping. The next subsection will give an explanation of the random forest algo-rithm followed by its relation to bootstrapping and proofs of its convergence as well as an upper bound for its generalization error.

3.3.1 The random forest algorithm

Random forests work by growing N randomized trees, using a random subset of the training set L. The latter randomization is a form of bootstrapping which Breiman calls bagging. For each tree, his algorithm selects an i.i.d. sample from the training data which is two thirds the size of the training data. The algorithm can be described as follows.

Definition 3.1. A random forest is a classifier consisting of a collection of tree-structured classifiers {T (x, lk) : k = 1, . . . , N } where the lk are independent identically distributed

random vectors drawn uniformly with replacement from the training data l and each of the trees casts a unit vote for the most popular class at input x.

Since the random forest algorithm involves multiple forms of randomization, it is difficult to develop a mathematical notation that does justice to all its aspects. Neither Amit and Geman nor Breiman nor any other paper studied gives a full mathematical account of the algorithm. The following is an attempt to give such an account.

Let (Ω,F , P) be the probability space of objects to classify and let C = {1, 2, . . . , c} be the space of classes. Furthermore, let Y : Ω →C be the classification function. Let L = (X1, . . . , Xn) be the training data, drawn i.i.d. from Ω with known class labels.

LetT be the space of functions Ω → C and suppose there is a sigma-algebra F_T on T . Then there is a probability measure PT that assigns probabilities to measureable

subsets ofT according to a randomization procedure such as the one outlined in 3.2.4. Given the distribution P on Ω and the probabilities over P_T conditioned on each possible realization of L, P_T is uniquely defined. In other words, we can derive a distribution P_T over the space of all trees if we know for each possible set of training data which trees can be created from the set and with which probability. Hence, the set of functions in T which cannot be created via any training set according to Amit and Geman’s tree induction process, is a P_T-null set.

Let lk be a realization of a random vector Lk of length z < n drawn uniformly i.i.d.

and with replacement from l. The tree T (., lk(l)) ∈ T , k = 1, . . . , N is a T -valued

random element drawn i.i.d. from T conditioned on a single realization l of L. Lastly, each realization tk then feeds into a random forest as follows.

Let (T N_{, σ(}_FN

T ), PTN) be the product probability space of vectors of length N with

random entries in T . Then the random forest is a random element F : TN _→ T .

Each realization f (x) = t1(·, l1(l)), . . . , tN(·, lN(l)(x) is defined as the maximizer of

a 7→ PN

k=11(t(x, lk) = a) for each x ∈ Ω, (a ∈ C ). In other words, each tree casts one

vote and the most popular class is selected.

The randomization of training data works on the same principle as bootstrapping. Namely, we sample training data from the sample distribution of an i.i.d. sample from

(25)

the underlying distribution P of the training data. No precise statements can be made about convergence in the space T . However, we can once again make the assumption that there is a norm || · || onT such that the combination of trees tk, k ≤ N into a forest

corresponds to taking their average. Furthermore, grant the assumption that there is a way to define an expected tree ¯T ∈T where the training data L are i.i.d. with distribu-tion P and an expected tree ¯T∗ ∈T where the training data are i.i.d. samples from the empirical distribution ˜P of a realization l of the training data. Then Efron [11] poses that ¯T∗ is the nonparametric maximum likelihood estimate for ¯T because ˜P is the non-parametric maximum likelihood estimate for P . Finally, let F = (T1(·, l1, . . . , TN(·, lN))

be a random forest with bootstrapped training data Lk ∼ ˜Pn, k = 1, . . . , N . Then the

Law of Large Numbers guarantees that F converges ˜P -a.s. to ¯T∗. In other words, the variance of F , EL1,...,LN||F − ¯T

∗_||2 _{→ 0 as N → ∞.}

3.3.2 Convergence of random forests

Breiman derives an upper bound for the generalization error of a random forest, which is proven in this subsection. The bound varies as a function of the strength of the individual trees and the correlation between them. That is, a lower expected generalization error of the trees lowers the bound and a higher correlation, measured by a particular function defined in this section, increases the bound.

The random forest improves accuracy through randomization and averaging of the trees by decreasing correlation without significantly decreasing strength. When a par-ticular tree overfits on a parpar-ticular training set, other trees based on other training data are unlikely to overfit in the same way. Similarly, when a randomized tree overfits on a training set, a different realization may not do so on the same training set.

We will prove two properties of the generalization error for random forests based on Breiman’s paper. The first result establishes a limit and the second establishes an upper bound.

For clarity, we write PX for the distribution P . Let X be a random element to

classify with class Y (x), drawn from (Ω,F , PX). Let the training set L be random

and let lk, k = 1, . . . , N denote a bootstrapped sample from l. For an ensemble f =

(t1(·, l1), . . . , tN(·, lN)) of classifiers the generalization error is given by

P E∗ = PX 1 N N X k=1 1(tk(X, lk) = Y (X)) < max j∈C j6=Y (x) 1 N N X k=1 1(tk(X, lk) = j) . (3.5) In other words, it is defined by the probability that the true class receives less votes than the most popular other class.

The following theorem poses a limit for P E∗. This result explains why random forests do not overfit as the number of trees grows, but limit the generalization error. There is a distribution on PX on Ω which generates the training data and Lk, k ∈ N as well as L

are i.i.d random variables from that distribution. Note that these conditions are more general than the bootstrap procedure where the distribution is a uniform i.i.d. draw with replacement from a given realization of L.

(26)

Theorem 3.2. As the number of trees increases, for PL-almost all sequences l1, l2, . . ., P E∗ converges to PX PL(t(X, L) = Y (x)) < max j∈C j6=Y (x) PL(t(X, L) = j) (3.6) Proof. Let S be the space of sequences l1, l2, . . . with the product probability measure

M derived from the distribution PLof the Lk. 3.2 holds if M -a.e. and for all x ∈ Ω and

all j ∈C 1 N N X k=1 1(t(x, lk) = j) −→ N →∞PL(t(L, x) = j). (3.7)

Given a training set lk, the set S ⊂ Ω described by the equation t(x, lk) = j is a union

of (the hook points of) hyper-rectangles. Namely, each leaf that classifies with class j corresponds to a hyper-rectangle in the feature-space such that a certain number of features are fixed and the rest are arbitrary. The number of such unions of hyper-rectangles is finite for all t(·, lk) together. Denote these unions S1, . . . , SK and define

φ(lk) = i if {x : t(lk, x) = j} = Si. Generate N random classifiers and let Ni be the

number of times that φ(lk) = i. Then

1 N N X k=1 1(t(x, lk) = j) = 1 N X i Ni1(x ∈ Si). (3.8) Furthermore, Ni = 1 N N X k=1 1(φ(lk) = i) −→ N →∞PL(φ(L) = i) (3.9)

M -a.s. by the strong Law of Large Numbers. The union of all the sets on which convergence does not occur for some value of k gives a null set C such that outside C,

1 N N X k=1 1(t(x, lk) = j) −→ N →∞ X i PL(φ(L) = i)1(x ∈ Si) (3.10)

The right-hand side equals PL(t(x, L) = j) and thus 3.7 holds.

We conclude with the relationship between the generalization error and the strength of and correlation between the random classifiers. We will derive an upper bound for the generalization error, which culminates in theorem 3.8.

Definition 3.3. J (x), the most popular incorrect classification of x, is the class j 6= Y (c) that maximizes PL(t(x, L) = j).

Definition 3.4. Given a random forest F made from realizations of the random tree T (·, L) the margin function at input x ∈ Ω is given by

(27)

Naturally, the generalization error of a random forest F is PX(m(X) < 0). Note that

a realization f (·, l) of F has a different margin function and the generalization error is given by 3.5.

Definition 3.5. Given a training set l and tree t(·, l), the raw margin function is r(l, x) = 1(t(x, l) = Y (x)) − 1(t(x, L) = J (x)) (3.12) Evidently, m(x) = ELr(L, X).

Definition 3.6. The strength s :T → [−1, 1] of a random forest F is

s(f ) = EXm(X). (3.13)

Definition 3.7. The correlation between two classifiers t(·, l1) and t0(·, l2) is given in

terms of their raw margin functions as

ρ(t, t0) = covX(r(X, l1), r(X, l2)) σr(X,l1)σr(X,l2)

, (3.14)

where σr(X,li), refers to the standard deviation of r(X, li), i = 1, 2 holding the training

set fixed.

The upper bound for the generalization error will be dependent on the average correlation ¯

ρ between trees. For i.i.d. training data L, L0 the average correlation between two random classifiers t(·, L), t0(·, L0) it is given by

¯ ρ = EL,L0(ρ(t, t 0_)σ r(X,L)σr(X,L0₎) EL,L0(σ_r(X,L)σ_r(X,L0₎) (3.15) Theorem 3.8. The generalization error P E∗ of a random forest F is bounded above by:

P E∗ ≤ ρ(1 − s(F )¯

2₎

s(F )2 . (3.16)

Proof. Chebyshevs’s inequality states that

PX(m(X) < 0) ≤ PX(|m(X) − s(F )| ≥ s(F )) ≤ varX(m(X))/s(F )2 (3.17)

Thus, we derive an upper bound for varX(m(X)) to prove 3.7. Note that any for

function f we can re-write

[EAf (A)]2 = EA,A0[(f (A)f (A0)] (3.18)

(28)

Since m(x) = ELr(x, L) this implies that for i.i.d. L, L0

m(x) = EL,L0[r(x, L)r(x, L0)]. (3.19)

We now derive a different expression for varX(m(X)). Note that all expectations

below are integrals that conform to the conditions of Fubini’s theorem which we will use repeatedly. varX(m(X)) = EX[m(X)2] − (EX(m(X))2 = EX[m(X)2] − EX[EL(r(X, L)]EX[EL0(r(X, L0))] = EX[m(X)2] − EL[EX(r(X, L)]EL0[E_X(r(X, L0))] = EX[EL(r(X, L))2] − ELEL0E_X(r(X, L))E_X(r(X, L0)) = EX[EL,L0(r(X, L)r(X, L0))] − EL,L0E_X(r(X, L))E_X(r(X, L0)) = EL,L0E_X(r(X, L)r(X, L0)) − E_X(r(X, L))E_X(r(X, L0)) = EL,L0[cov_X(r(X, L), r(X, L0))] = EL,L0[ρ(L, L0)σ_r(X,L)σ_r(X,L0₎] = ¯ρ(EL(σr(X,L)))2 ≤ ¯ρ EL(σ2r(X,L))

Using the usual definition of the variance of a random variable, write EL(σr(X,L)2 ) = EL[EX(r(L, X)2) − (EX(r(X, L)))2]

= EL[EX(r(L, X)2)] − EL([(EX(r(X, L))]2)

≤ 1 − (EL[EX(r(X, L))])2

= 1 − s(F )2

Lastly, we combine this with 3.17 and the upper bound for varX(m(X)) to conclude the

proof.

The lower the correlation the less interdepent the trees are. Therefore it is less likely for a collection of trees to overfit in the same way and any individual tree overfitting will be outbalanced by other trees. Breiman suspects that the boundary tends to be very loose but it does fulfill a valuable function by showing which variables the generalization error depends on. This concludes the final chapter on random forests.

(29)

4 Conclusion

We have set up a framework for the analysis of statistical classification problems and examined tree classifiers and random forests. Working in a Bayesian set up, the methods studied here attempt to minimize the probability of misclassification.

Tree classifiers are a simple classification tool that exibit arbitrarily low bias but at the cost of high variance. We have covered pruning of tree classifiers, a major field of research that also aims to reduce the variance that comes up with single trees. More pruning methods than given here have been developed and this field has been thoroughly studied both empirically and with mathematical rigour [6][7][12][13].

Random forests are an effective classification tool. They avoid overfitting through the Law of Large Numbers. They are a supervised learning algorithm which generalizes from a set of labelled training data with known class labels. This distinguishes them from unsupervised learning models where there are no labeled data and the algorithm finds hidden structure in the space of objects Ω. By considering a random subset of the feature space at each split, the dependence between individual trees is reduced (while trees are representatively sampled from the space of classifiers T ) and computational resources are saved. The dependence between trees is further decreased by bootstrapping training data. This also reduces the variance of the random forest.

We have seen that the generalization error of a random forest converges a.s. to a limit as the number of trees increases due to the Law of Large Numbers. Furthermore, there is an upper bound for the generalization error which depends on the strength of the individual trees and the dependece between them. The challenge is to induce the right amount of randomization such that the dependence decreases while strength does not decrease significantly.

When describing the convergence of random forests due to the Law of Large Numbers we have run into the difficulty that there is no norm given on the space T of classifiers which does not strip away most of the information about a given classifier. An interesting question for further research may be whether such a norm can be constructed and whether it would yield a proof for the reduction in bias and variance that random forests achieve. As Breiman remarks, the mechanism that reduces bias is poorly understood [5]. A better mathematical understanding could shed light on the discussion about the workings of random forests.

(30)

5 Popular summary

The title page of this research thesis shows an example of a classification problem. There are a number of animals and they have to be classified into mammals and non-mammals. However, we cannot observe which class each animal actually belongs to so we have to estimate whether it is a mammal based on certain features it has. For example, all cold-blooded animals are non-mammals. But among the warm-blooded animals, all the ones that give birth are mammals and the ones that don’t are not. When we write this structure of classification up, we create what is called a classification tree. It has a root node at the top where all the examples to classify start. Then they descend down the tree, taking a direction at each split until they en up at a ‘leaf node’. Leaf nodes are associated with a class.

The features don’t always give enough information to classify examples with a hundred percent accuracy. Instead, they give us evidence about the class of an example and in the end there is a probability that the classification is correct. In this research thesis we operate in a Bayesian context which means the aim is to minimize the probability of misclassification.

The construction of a tree classifier is a widely discussed topic in the field of machine learning. It generally starts with a number of labelled examples where we already know the class. First, we create a root node. Then we select in which way to split the examples. The best split is the one that tells us the most about the class of the objects. There are several ways to measure how much we know about the class and one straightforward way is the probability of misclassification. I.e. if we stopped the tree construction after the present split, what share of the examples would we classify correctly if we assigned to each example at each node the class that is most common at that node. The tree construction stops when all examples at a node have the same class or some other stopping criterion is satisfied.

Tree classifiers are prone to errors because they tend to find noise in the training set and see a pattern in it. To fix this, on can create multiple different trees and have them vote on which class to assign to an example. Because they all make mistakes in different ways, single mistakes get averaged out and the actual patterns that determine the class prevail. To create different trees we can randomize which training data are used to build the trees and which splits are available to select at each node. The random creation of multiple trees is called a random forest.

In the last part of the thesis we prove some results about random forests. In particular, we show that their probability of misclassification has an upper bound. Naturally this bound will be lower when the individual trees are better classifiers. However, it is also lower when the trees are less similar to one another. This is because they then don’t make the same errors. There are some open research questions about why random forests

(31)

(32)

Bibliography

[1] B J K Kleijn. Bayesian Statistics, Lecture Notes 2015, Korteweg-de-Vries Institute for Mathematics, 2015.

[2] H R Simpson and K Jackson, Process Synchronisation in MASCOT, The Computer Journal, 22, 332, 1979.

[3] Y Amit and D Geman, Shape quantization and recognition with randomized trees, Neural computation, 9(7), 1545-1588, 1997.

[4] B D Ripley, Pattern recognition and neural networks, Cambridge university press (2007), in particular pp. 221-228.

[5] L Breiman, Random forests. Machine learning, 45(1), 5-32, 2001.

[6] Tan, Pang-Ning, M Steinbach and V Kumar, Introduction to data mining, 1, Boston: Pearson Addison Wesley, 145-195, 2006.

[7] J Quinlan, R Rivest, Inferring decision trees using the minimum description length principle. Information and computation, 80(3), 227-248, 1989.

[8] S Fortman-Roe, Understanding the Bias-Variance Tradeoff, http://scott.fortmann-roe.com/docs/BiasVariance.html, 2012.

[9] L Breiman, J Friedman, C J Stone, R A Olshen. Classification and regression trees, CRC press, 1984.

[10] Trier, O Due, A K Jain and T Taxt. Feature extraction methods for character recognition-a survey Pattern recognition, 29(4), 641-662, 1996.

[11] B Efron. The jackknife, the bootstrap and other resampling plans. Philadelphia: Society for industrial and applied mathematics, 38, 1982.

[12] L A Breslow, D W Aha. Simplifying decision trees: A survey, The Knowledge Engineering Review, 12(01), 1-40, 1997.

[13] F Esposito, D Malerba, G Semeraro & J Kay, A comparative analysis of meth-ods for pruning decision trees. IEEE transactions on pattern analysis and machine intelligence, 19(5), 476-491, 1997.

Random Forests

Random Forests

S¨

oren Mindermann

September 21, 2016

Bachelorproject

Begeleiding: dr. B Kleijn

Abstract

Contents

1 Introduction to Decision Theory

1.1 Decision Theory

1.1.1 Frequentist Decision Theory

1.1.2 Bayesian Decision Theory

1.2 Classification

2 Classification with Decision Trees

2.1 Supervised learning

2.2 Tree classifiers

2.2.1 Construction of tree classifiers

2.2.2 Bias and Variance

2.2.3 Model overfitting

2.2.4 Pruning

3 Random Trees and Forests

3.1 Bootstrapping

3.2 Randomized trees for character recognition

3.2.1 Introduction

3.2.2 Feature extraction

3.2.3 Exploring shape space

3.2.4 Feature randomization

3.3 Random Forests

3.3.1 The random forest algorithm

3.3.2 Convergence of random forests

4 Conclusion

5 Popular summary

Bibliography