Date: Jan 31, 2014

Bayesian classification

For a normal model (homoscedastic/heteroscedastic) the R functions lda and qda that we had used earlier can perform Bayesian classification as well. Let's apply this on our familiar banknote dataset.


x = read.table('banknote1.txt',head=T)
names(x)

This is the same data set , but the last column is given as strings ("genuine" and "forged" instead of numbers). This can be seen from the output of the following command.


x [ c(1:3,101:103) , ]

Output from R

    Length  Left Right Bottom  Top Diagonal       Y
1    214.8 131.0 131.1    9.0  9.7    141.0 genuine
2    214.6 129.7 129.7    8.1  9.5    141.7 genuine
3    214.8 129.7 129.7    8.7  9.6    142.2 genuine
101  214.4 130.1 130.3    9.7 11.7    139.8  forged
102  214.9 130.5 130.2   11.0 11.5    139.5  forged
103  214.9 130.3 130.1    8.7 11.7    140.2  forged

Now we want to invoke the lda function from the MASS library. So we load it.


library(MASS)

Next we have to specify the prior. Suppose that we want to assign prior probability 0.1 to "forged" and 0.9 to "genuine" (ie, we believe that 10% of all banknotes of this particular denomination are forged). So the prior is a vector with two components, 0.1 and 0.9. The question is should we write c(0.1,0.9) or c(0.9,0.1). For this need to know which come first inside R: "genuine" or "forged"? The answer is to make a frequency distribution of this variable:


with(x, table(Y))

The output shows that, inside R, "forged" comes before "genuine". So we should use the prior c(0.1,0.9).

Output from R

Y
 forged genuine 
    100     100

So the lda function may be called like this:


fit1 = lda(Y~.,data=x, prior=c(0.1,0.9))
fit1


predict(fit1,list(Length=215.0,
                  Left=130.2,
                  Right=130.1,
                  Bottom=9.5,
                  Top=10.8,
                  Diagonal=140.2))

Among other things, we get the posterior probabilities:

Output from R

$posterior
     forged    genuine
1 0.9577575 0.04224245

Very similar commands work for qda as well:


fit2 = qda(Y~.,data=x, prior=c(0.1,0.9))
fit2
predict(fit2,list(Length=215.0,
                  Left=130.2,
                  Right=130.1,
                  Bottom=9.5,
                  Top=10.8,
                  Diagonal=140.2))

C&RT

We shall first work with the toy problem from last class. For this we generate data to get the pattern:

We need to sprinkle random data into each rectangle. The following function will help:


dataInRect = function(x1,y1,dx,dy,howMany,whichClass) {
   cbind(
         runif(howMany, min=x1+5,max=x1+dx-5),
         runif(howMany, min=y1+5,max=y1+dy-5),
         rep(whichClass, howMany)
         )
}

Now we generate the data:


genData = function() {
  tmp = data.frame(rbind(
    dataInRect(300,0,80,200,25,1),
    dataInRect(450,0,50,120,12,1),
    dataInRect(380,0,70,120,11,2),
    dataInRect(380,150,120,30,9,2),
    dataInRect(380,120,120,30,15,3)))

    names(tmp)=c('X','Y','class')
    invisible(tmp)
}
dat = genData()

Let's plot the data.


with(dat, plot(X , Y ,
               col = class , pch=20 , cex=2))
points(400,80,pch='x',cex=2,col='magenta') #A new point to be classified

R chooses the colours for us. This is cool, but it is difficult to understand which colour is for which class. So it is usually better to assign our own colours:


mycol = c('red','green','blue')
with(dat, plot(X,Y,col=mycol[class],pch=20,cex=2))
points(400,80,pch='x',cex=2,col='magenta') #A new point to be classified

Now we shall apply C&RT. First we show the simplest usage:


library(rpart)
tr=rpart(as.factor(class) ~ X + Y, data=dat)

If you try to print the tree tr like

tr

the output looks pretty confusing:

Output from R

n= 72 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 72 35 1 (0.5138889 0.2777778 0.2083333)  
   2) X< 380.2611 25  0 1 (1.0000000 0.0000000 0.0000000) *
   3) X>=380.2611 47 27 2 (0.2553191 0.4255319 0.3191489)  
     6) Y< 118.2528 23 11 1 (0.5217391 0.4782609 0.0000000)  
      12) X>=448.9073 12  0 1 (1.0000000 0.0000000 0.0000000) *
      13) X< 448.9073 11  0 2 (0.0000000 1.0000000 0.0000000) *
     7) Y>=118.2528 24  9 3 (0.0000000 0.3750000 0.6250000)  
      14) Y>=150.863 9  0 2 (0.0000000 1.0000000 0.0000000) *
      15) Y< 150.863 15  0 3 (0.0000000 0.0000000 1.0000000) *

A more legible output may be obtained by plotting it as a tree:


plot(tr)
text(tr)

Depending on your system, some of the labels may be clipped off. Then you need to set your margins appropriately:


oldpar = par(mar = rep(0.1, 4))
plot(tr)
text(tr)

Here are some things to experiment with:


plot(tr , branch = 0)
text(tr)

plot(tr , branch = 0.3)
text(tr)

Don't forget to restore the margins to their original value before plotting something else.


par(oldpar)

Let's predict the class of the new case:


predict(tr, list(X=400,Y=80))

A closer look

We talked about the GROW and PRUNE stages of C&RT. What you saw just now is the outcome of the GROW step. R will never prune a tree until you tell it to do so.

For our toy problem we had listed all the subtrees in the last class and had computed

f_α (T) = R(T) + α |T|

for each. These we linear functions of α. We shall now plot them using R.


#png('cartplot.png')
plot(0,xlim=c(0,20),ylim=c(0,100),ty='n',
     xlab=expression(alpha),
     ylab=expression(f[alpha](T[i])))


abline(a=0,b=5,col='red')


abline(a=9,b=4,col='blue')
abline(a=11,b=4,col='green')
abline(a=20,b=3,col='magenta')
abline(a=27,b=2,col='yellow')
abline(a=35,b=1,col='black')

Let's draw vertical line through the "knee".


abline(v=35/4,lwd=3,lty=3)

Too many lines! A legend is what we need:


legend("topleft",
       col=c('red','green','blue','magenta','yellow','black'),
       lwd=1,
       legend=c(
         expression(T[1]),
         expression(T[2]),
         expression(T[3]),
         expression(T[4]),
         expression(T[5]),
         expression(T[6])))

If you are saving this in a file, then don't forget to uncomment the following line.


#dev.off()

So we see that α = 35/4 is the value of α at the knee. We obtained this by brute force. Now let us see how we can get the same info from R. Invoke the rpart function again, but now with the additional control parameter that asks R to explore all values of α>0. The name cp is what R uses to mean the minimum permissible value of α.


tr=rpart(as.factor(class) ~ X + Y, data=dat,
         control=rpart.control(cp=0))

Now let's find the "knee" (or "knee"'s in a more general problem):


printcp(tr)

Output from R

Root node error: 35/72 = 0.48611

n= 72 

    CP nsplit rel error  xerror     xstd
1 0.25      0         1 1.00000 0.121172
2 0.00      4         0 0.17143 0.067006

Look at the CP column. It has two entries: 0.25 and 0.00. It means that there are two "critical" values for α, the first is 0.00, which is the minimum value, and second is the "knee" we were looking for. But we know that the knee should be at α = 35/4. Why is it reported as 0.25 then? This is an idiosyncrasy of R: it divides all the columns (except nsplit) by the number of misclassifications at the root node, which is 35 in our example. Thus α values are 0 and 35× 0.25 = 35/4 as we expected.

The nsplit column tells us the number of splits in the best tree starting from that value of α. Thus, 4 corresponds to the full tree, and 0 corresponds to just the root node.

The rel error gives the resubstitution errors (of course scaled down by 35, hence called "relative" error). We shall discuss the other two columns soon. But before that let us see how to prune the tree:


tr1 = prune(tr,cp=0.3)
plot(tr1) #Oops!

Of course, here the pruned tree is just the root node, so R refuses to plot it!

Banknotes again

Let us apply C&RT to the banknote problem:


x = read.table('banknote1.txt',head=T)
tr=rpart(Y ~ . , data=x,
         xval=0 , control=rpart.control(cp=0))


oldpar = par(mar = rep(0.1,4))
plot(tr,branch=0)
text(tr)
par(oldpar)

Here we get a very simple tree! Just compare this with the LDA classifier. Indeed, one important reason behind th popularity of C&RT is the simplicity and easy interpretability of the classifers it produces.

In both our examples the trees turned out to be hopelessly simple, allowing us no chance to prune the tree in a nontrivial fashion. The next example (taken from the vignette of the rpart package) is more nontrivial.

Prostage cancer data

The data set comes with the rpart package. Load it using


data(stagec)

Or you may download it here, put it in your working directory and load it with


load('stagec.rda') #Creates a new variable stagec
names(stagec)

Read about the data set in the package vignette:


vignette(rpart)

Here the relevant excerpts:


progstat = factor(stagec$pgstat, levels = 0:1, labels = c("No", "Prog"))
cfit = rpart(progstat ~ age + eet + g2 + grade + gleason + ploidy,
             data = stagec,
             control=rpart.control(cp=0))

Let's draw the tree:


par(mar = rep(0.1, 4))
plot(cfit)
text(cfit)
par(oldpar)

Now for the "knee"s:


printcp(cfit)

Output from R

Root node error: 54/146 = 0.36986

n= 146 

         CP nsplit rel error  xerror    xstd
1 0.1049383      0   1.00000 1.00000 0.10802
2 0.0555556      3   0.68519 1.05556 0.10916
3 0.0277778      4   0.62963 1.01852 0.10843
4 0.0185185      6   0.57407 1.01852 0.10843
5 0.0092593      7   0.55556 0.98148 0.10760
6 0.0000000      9   0.53704 0.96296 0.10715

We know how to interpret the first three columns. Remember that here the scaling factor is 54, the number of misclassifications in the root node. Now we shall learn about the last two columns. In order to find the best value of α R automatically performs 10-fold CV for each of the 6 best subtrees corresponding to the α values in the intervals between the "knee"s. Here is how it proceeds:

For each such interval
   R picks a represenative value of α.
   Then R splits the data randomly into 10 parts.
   For each of 10 parts
     R applies C&RT to GROW a tree using the data minus that part.
     R applies the chosen value of α to PRUNE the tree.
     R uses this tree to predict the left out part.
     R records the number of misclassifications made.
   End for
   Now R has 10 misclassifications one for each part. 
   These are averaged and average is called xerror.
   The std error of this xerror is called xstd.
End for
For each interval we thus have one xerror and one xstd.

These are what you see in the last column. The idea is to choose the interval giving the least xerror. However, CV involves randomisation, and so minor differences in xerror values may be attributed to chance. That's where xstd comes into picture. A rule of thumb is to consider the minimum xerror, and to consider any xerror falling as "basically the same as" the minimum. Choose the largest α value among these (largest α means smallest tree). This procedure is simplied by the plotcp function:


plotcp(cfit)

Comment Box is loading comments...