Date: Jan 29, 2014

[[Update:[Wed Jan 31 IST 2018]]

A more formal look at cross-validation

We have already seen that CV is a way to assess the performance of a supervised classification method. Given a supervised data set and a supervised classification method, it produces a number that measures how badly the method performs. Here it is important to distinguish between a supervised classification method and the classifier it produces. For example Fisher's LDA is a supervised classification method. For a particular data set it may produce a classifier like
"Put a case in class 1 if 2.353 X1-4.867 X2 < 3.509."
This is a classifier. We can think of LDA as a function of the data. This particular classifier is the value produced by this function for a particular data set. Thus for a supervised data set Train, we have a classifier LDA( Train ) obtained by LDA to it. Now given another data set Data2 we can assess the performance of LDA( Train ) on Test. Suppose we do this by the proportion of wrong classifications:
Loss ( LDA ( Train ) , Test ) .
If we reuse Train to serve as Test then we get
Loss ( LDA ( Train ) , Train ) .
This is called resubstitution error, and is generally an under-estimate of the true error rate. Judging a classification method based on the resubstitution error usually leads to overfitting. Cross-validation is one attempt to counter this. Here we randomly split the supervised data into k parts (typically k=5 or 10 leading to 5-fold or 10-fold CV):
Data1 , ... , Datak .
Each part should represent the entire data set. Thus simple random sampking of cases is usually not a good idea. Now the CV error rate is defined as
mean of Loss ( LDA ( Data-i ) , Datai ) , i=1 , ... , k.
It must be bourne in mind that CV is often not enough to remove overfitting.

Bayesian Classification

In elementary probability classes we solve problems like the following.
A class consists of 3 boys and 7 girls. Among the 3 boys only 1 wears glasses. Among the 7 girls there are exactly 2 with glasses. A student is selected at random. If the selected student wears glasses, what is the chance that the student is a boy?
Such a problem requires Bayes' Theorem, and the computation may be elegantly presented in the form of a diagram:

Bayes diagram
There are two paths (red and green) to the what we have observed. Their probabilities are
P(red path) = 3/10 × 1/3 = 1/10,
P(green path) = 7/10 × 2/7 = 2/10.
Given that we hae taken one of these paths, we want to know the conditional probability that we have taken the red path:
P(red) / [ P(red) + P(green) ].
Think of this as a classification problem. The students are the cases to be classified into two classes Boy and Girl. We have only one measurement "Does the student wear glasses?". We can compute the posterior probabilities of the student being a boy and being a girl. Then classifiy the student in the clss with the higher posterior. In fact, the posteriors also reflect how confident in our decision.

In the banknote example we can have a similar (infinite) diagram:

Bayes diagram
Here also we can proceed in a similar way.

This approach has one advantage: it can incorporate losses. To understand this consider a real life example of credit risk analysis. Before giving loan to an applicant a bank has to assess his credit-worthiness, ie, whether he is going to default or not. For this the bank makes lots of measurements about the applicant (monthly income, credit history, number of bank accounts, age, etc). Based on these measurements the appicant has to be classified into one of two classes: Good or Bad.

The bank will use its own past records to get a supervised data. Suppose that the bank has data about 10000 past cases where the loan was granted. Is it reasonable to assume that about half of these were Good the rest being Bad? No, then the bank would have been bankrupt by now! Only a very small number (say 50) of these cases will be Bad. If you apply any of the classifiers we have learned about, they will all suggest classifying every future case as Good, thinking that Bad is a rare event (with probability 0.005), and hence may be ignored. But that is not the right thing to be done. Those 50 cases are in fact the most important cases. They are the cases that has defeated the bank's already efficient fraud detection scheme. So if we want to improve the detection we must not ignore these! It is here that the concept of a loss function arises:
L ( verdict | truth ).
This measures the amount of loss incurred when you make a mistake. For example, we may take
L ( Bad | Good ) = 1,
L ( Good | Bad ) = 100,
which makes giving loan to a Bad guy a much more serious mistake than refusing loan to a Good guy. We of course define
L ( Bad | Bad ) = 0,
L ( Good | Good ) = 0.
Now instad of choosing the class with higher posterior probability we compute the posterior risk, which is another term for the expectation of the loss function under the posterior distribution. Let us take the example of students to illustrate. Suppose we have the loss function
L ( Boy | Girl ) = 1,
L ( Girl | Boy ) = 10,
L ( Girl | Girl ) = 0,
L ( Boy | Boy ) = 0.
Thus here mistaking a Boy for a Girl is a 10 times more serious mistake than the opposite mistake. As before we select a student at random and find that s/he wears glasses. We have two possible decisions: Boy and Girl. Let's compute the posterior risk for each. We had already computed the posterior probabilities:
P (Boy | Glasses ) = 1/3 and P ( Girl | Glasses ) = 2/3.
If we give the verdict "Boy", then the posterior risk is
L( Boy | Boy ) P( Boy | Glasses ) + L( Boy | Girl ) P( Girl | Glasses ) = 2/3.
Similarly, the posterior risk for the verdict "Girl" is
L( Girl | Boy ) P( Boy | Glasses ) + L( Girl | Girl ) P( Girl | Glasses ) = 10/3.
We see that the posterior risk is lower in the first case. So we give the verdict "Boy".

Caveat

Bayes classification, as well as other Bayesian methods, suffers from a serious drawback. One has to supply reasonable prior probabilities, and (optionally) a loss function. Though prior probabilities and loss functions are present in all problems in principle, they are usually very hard to quantify. The outcome of a Bayesian method is very sensitive wrt the choices of prior and loss functions, ie, even seemingly minor changes in these may drastically change the conclusion. This very drawback, however, makes Bayesian methods popular among dishonest practitioners of statistics. They use their own intuition and subjective judgements to draw a conclusion, and then back calculate priors and loss functions that would lead to those conclusions. Then they can pass off their dubious subjective judgements as objective findings supported by Bayesian computation. In this sense, Bayesian statistics often plays the role of computer horoscope, a scientific varnish on unscientific practices.

Classification and Regression Tree (C&RT)

Today we shall learn about a very versatile and popular supervised classification technique called C&RT. Unlike LDA and QDA it is a nonparametric method. We shall start with an example.

Example: Consider students seeking admission to ISI. The record for each student consists of 4 explanatory variables: Bachelor's degree grade, screening test score, main test score, interview score. Based on these we decide whether a student should be admitted or not (PASS or FAIL). As we all know the decision process used at ISI may be expressed as a tree.


ISI admission tree
This is called a (binary) classification tree. It puts each student in a class: PASS or FAIL. Each leaf node is labelled with a class name ("PASS" or "FAIL"). Each inner node is labelled with an inequality of the form
Xi < some number,
where Xi is one of the explanatory variables. If the inequality holds for a student we follow the left branch, else we follow the right branch. ◼

Each branch in the tree partitions the sample space. The following interactive animation will help to clarify the idea. Suppose we have c=3 classes and d=2 measurements. A typical scatterplot (with colour-coded classes) is shown in the pink box below. We shall keep on splitting the box with horizontal and vertical lines until each block is "pure" (ie, contains points of only one colour). This can be done in many ways. One way will be demonstrated if you repeatedly click the "Click" button.

Carefully notice how the binary tree to the left corresponds to the partitioning. Each leaf node corresponds to one block of the partition.

Misclassification rate

Suppose someone gives us a classification tree as above (with a,b,c,d replaced by actual numbers.) Based on a dataset how do we assess its performance? One way is to apply the tree to each case in the data and see what class it is assigned to by the tree. Since the dataset also tells us the true class for each case, we can check if the tree has made a mistake. Do this for all the cases and count the number (m, say) of mitakes. mistake. This is the number of misclassifications. Divide it by the total sample size to get the misclassification rate.

Using C&RTα

While we prefer trees with a low number misclassifications, we also want our tree to as small as possible. Large trees require us to find many cut-off points and usually tend to overfit the data. There is a trade-off between the size of a tree and its misclassification proportion. Larger the tree less is the misclassification, but too large a tree will produce overfit. C&RT is a family of modelling strategies to achieve a balance between size and misclassification proportion. Each member in the family is specified by a nonnegative index, alpha (also called a tuning parameter. It is just a fancy term, and has nothing to do with the parameters of a model.) We shall call the modelling strategy for a given value of α as C&RTα . Thus C&RTα produces a tree that minimises
number of misclassifications + α × #Leaf
For the ISI admission tree, for instance, #Leaf = 5.

The tuning parameter tries to penalise large trees. If α is large then C&RTα is more likely to produce smaller trees. It may not be obvious how to choose α. We shall later see how CV is used to choose a good value for α. But first let us learn how C&RTα works if some α>0 is given to us.

How C&RTα works

Suppose we are given a training sample of size n. Each record in the data is a d+1-tuple (X1 , ..., Xd , C). Here the X's are the measurements and C is the true class. The idea is that in future we shall have more cases for which only the X's will be known. Then we shall use our tree to predict their classes.

C&RTα effectively finds a good way to partition the space recursively, so that each block is as homogeneous as possible. This is done through two steps:

GROW

For d=2, the first split is either a horizontal or a vertical line through the entire sample space. So we have two things to determine in the first split: orientation (ie, horizontal/vertical) and where to draw the line. Notice that due to the discrete nature of the data points it is enough to consider only finitely many lines of either orientation. This is because, until a line crosses a sample point the misclassification proportion remains unchanged. In general there are n lines to consider along each of the d axes. For each of these nd lines, we compute the number of misclassifications. This is done as follows. Label each side of a line with a class using majority vote. Now count the number of points on either side whose classes does not match the assigned classes. Add the two numbers.

Choose a line with the minimum number of misclassifications. Resolve ties arbitrarily.

Now the sample space has been partitioned into two blocks. Each block has been labeled with a classname (eg, "PASS" or "FAIL" in the ISI example.) This partition corresponds to a tree with two leaf nodes and one inner node (the root). Each leaf node is labelled with a classname. The inner node is not labelled with an inequality. Note that the leaf nodes correspond to the blocks of the partition.

Carry out the above procedure separately for each of the blocks. Thus each of these blocks will produce two blocks and so on.

The next question now is: "When to stop?" The answer is: Do not split a block further if
either
the number of sample points in it is less than 5,
or
the block is homogeneous, ie all the points in it belongs to the same class.
The number 5 of course has no magic associated with it. You may choose any other threshold to decide when a block has too few points to warrant further partitioning. In fact, if you make this threshold equal to 1, all the blocks would be homogeneous (because any block with just one point is trivially homogeneous!)

At the end of GROW you are left with a tree that is usually pretty huge. Each leaf node (ie, each block of the partition) is either homogeneous or has too few points in it.

PRUNE

The tree obtained by GROW usually overfits the data. So we need to prune it. It is here that α makes its appearance. We want a tree that is not too large, and yet fits the data well. So we try to obtain a subtree T that minimimises
fα (T) = R(T) + α #Leaves(T) ,
where R(T) is the total number of misclassifications produced by T. It is instructive to find this minimiser (which depends on α) by listing all the subtrees in our toy example. Of course, this approach is prohibitively time consuming for a practical problem. The following slideshow does precisely this.
Slideshow
Click on next.
Thus we end up with the following linear functions in α:
fα ( T1 ) = 5 α
fα ( T2 ) = 11+ 4 α
fα ( T3 ) = 9 + 4 α
fα ( T4 ) = 20 + 3 α
fα ( T5 ) = 27 + 2 α
fα ( T6 ) = 35 + α
For any given α>0 we have to choose a Ti for which fα ( Ti ) is the minimum. The following plot will help here.

A close look at the plot shows that T1 is the optimal tree for α∈ ( 0 , α0 ] where α0 is shown using the thick dashed line. For a>α0 the best tree is T6, which is just the root node!

Of course, most real life examples are not as drastic as this. In general we get a finite sequence of values for α:
0 < α0 < α1 < ... < αk ,
for some k ∈ IN, and a corresponding sequence of nested subtrees of the initial huge tree:
T0 contains T1 contains T2 contains ... contains Tk contains Tk+1,
where T0 is the full tree, and Tk+1 is just the root node. For example, a typical case with k=2 is shown below.

For any value of α ∈ [0, α0 ) the best tree is T0. For α ∈ [ α0, α1 ) the best tree is T1, and so on. The final tree Tk+1 is optimal when α ∈ [ αk , ∞ ) .

As the total number of subtrees of the full tree is huge, finding this optimal sequence cannot be done by brute force search. Breiman et al devised a clever algorithm for this. This is the famous C&RT algorithm presented in their book with the same name. We shall not go into the details of the algorithm here, but simply treat is a blackbox that takes a supervised data set and an α>0 as its input, and produces a classification tree as its output. Think of this as a function
C&RTα (data) = α-optimal classification tree.
The next question is: how to choose α ?

The answer is: Cross-validation.

Cross-validation

We have already seen that CV is a way to assess the performance of a supervised classification method. Given a supervised data set and a supervised classification method, it produces a number that measures how badly the method performs. Here it is important to distinguish between a supervised classification method and the classifier it produces. Each particular tree produced by C&RT is a classifier, while for each given α , C&RTα is a classification method. CV is a way to assess the performance of a classification method, not that of a classifier.

Choosing a good value of α may be considered as choosing a good classification method among the set
{ C&RTα : α > 0 }.
For this purpose we can take a grid of values for α and assess each using CV, and choose the one that produces the least CV error rate.
Comment Box is loading comments...