Date: Mar 07, 2014

Model-based clustering

We shall apply model-based clustering on the Old Faithful Geyser data set. This data set is about a geyser situated in the Yellowstone National Park in the USA:

It erupts intermittently. Careful record has been kept of duration of each eruption as well as that of the waiting period preceding it. This data set is part of R.


dim(faithful)
names(faithful)

The two variables store the durations in minutes.


plot(faithful)

The plot reveals two clusters. Apparently longer eruptions follow longer waits. We shall see if model-based clustering also says the same thing or not.


library(mclust)

The main work-horse is the Mclust function:


faithfulMclust =  Mclust(faithful,modelNames="EEE",G=2)

We are supplying the following specifications:

The name of the data set: faithful,
The type of clusters: EEE. We shall explain this soon.
The number of clusters: G=2.

The Mclust function allows only (multi)normal model. So the only freedom for the user to choose the type of clusters lies in specifying the covariance matrices, which controls the elliptic spread of the data. The following table (taken from the vignette of the mclust package) gives the complete list.

Let's peep inside the object returned by Mclust:


summary(faithfulMclust)

Unfortunately, it is not very informative. Here is a more verbose version:


summary(faithfulMclust,param=T)

Of course, we are naturally interested in which case is in which cluster. This information must be there somewhere. Let's explore:


names(faithfulMclust)

The field called classification looks promising:


faithfulMclust$class

Unlike other clustering/classification techniques, model-based clustering allows clusters to overlap. So there is obviously uncertainty related to the clustering performed using this method. This is stored as a number between 0 and 1 in the uncertainty field:


faithfulMclust$uncer

Remember that we had to specify three things: the data set, the type of clusters required, and the number of clusters. Only the first of these is compulsory. Mclust can figure out the rest, if unspecified. Let's allow Mclust to choose the optimal number of classes.


faithfulMclust =  Mclust(faithful,modelNames="EEE")

The summary is similar to what we had last time, except that Mclust thinks that 3 clusters are better than 2:


summary(faithfulMclust,param=T)

Let's make a plot:


plot(faithfulMclust)

The plot actually consists of 4 plots. The first plot shows how Mclust chooses the optimal number of clusters. It simply tries out model-based clustering for cluster numbers ranging from 1 to 9, and chooses the one with the maximum Bayesian Information Criterion (BIC) value:

max log-likelihood - log(n) number of free parameters estimated.

The first term measures how good the fit is, while the second term measures how expensive it is (amount of data, and number of estimations needed). The first plot shows BIC against different cluster numbers.

The next plot is what we were yearning to see: the clusters shown in different colours.

The third plot shows the levels of uncertainties. The big, black dots correspond to the points with a high level of uncertainty.

The last plot shows the contour of the fitted mixture density.

The next invocation of Mclust supplies even less information: just the data set. Here Mclust is supposed to choose the best possible cluster shapes/sizes from the list given earlier.


faithfulMclust =  Mclust(faithful)

What has Mclust chosen for us? Let's see:


summary(faithfulMclust)

Well, it has chosen EEE with 3 clusters.

Playing with copula in R

There are a number of packages in R to work with copula. We shall use the copula package:


library(copula)

What does this package allow us to do? A short answer is:

Creating a copula (from a fixed list)
Finding cdf, density (if it exists) of the copula
Generating random numbers from a copula
Making new multivariate distributions by combining user-specified marginal distributions with a user-specified copula
Working with such distributions (fitting, finding cdf, pdf, generating random numbers)

Creating a copula

The following line creates a copula from bivariate normal distribution with correlation 0.9. Notice that the choice of mean vector and the variances do not influence the copula (why?).


cop = normalCopula(dim = 2, param=0.9)

cop

You can extract the covariance matrix (useful for checking):


getSigma(cop)

Mathematically, a copula is a cdf. But R likes to think of a copula as an object. Thus R talks about the cdf of a copula:


pCopula(c(0.5,0.5),cop)

It is a good idea to check that this is correct. It should be

H(Φ^-1 (x) , Φ^-1 (y)),

where H is the cdf of the bivariate normal distribution. If you are not sure where it came from let me remind you that if H(x,y) is a cdf with continuous marginals F(x) and G(y) then the copula underlying H is

H ( F^- (x) , G^- (y) ).

The qnorm function of R can already compute Φ^-1. We shall use the mvtnorm package to compute the bivariate normal density H:


library(mvtnorm)

covMat = getSigma(cop)

pmvnorm(qnorm(c(0.5,0.5)),sigma=covMat)

Indeed, we get the same answer.

Let's compute the pdf. The function dCopula extracts the pdf (if i exists) of a copula object:


dCopula(c(0.5,0.5),cop)

Checking this is similar to checking the cdf: Just differentiate

H(Φ^-1 (x) , Φ^-1 (y)),

partially w.r.t. x and then w.r.t. y to get

h(Φ^-1 (x) , Φ^-1 (y)) / [φ ( Φ^-1 (x) ) φ ( Φ^-1 (y) )],

where h(x,y) is the bivariate normal pdf, and φ is N(0,1) pdf. We check using the mvtnorm package:


dmvnorm(qnorm(c(0.5,0.5)),sigma=covMat)/
   dnorm(qnorm(0.5))²

Plotting the copula (or rather its cdf) is a good way to visualise a bivariate copula:


persp(cop,pCopula,xlim=c(-1,2),ylim=c(-1,2))

Of course, it was a stupid idea to plot a copula outside the [0,1]×[0,1] box. We did it just to show that R considers a copula as a function over this box only. So we can safely avoid specifying the plotting region:


persp(cop,pCopula)

It's a pity that we cannot move this 3D plot interactively with the mouse. But we can move it programmatically:



persp(cop,pCopula,
      theta=60,phi=45,
      xlab="x",ylab="y",
      zlab="Copula CDF")

Getting a good combination of theta and phi is often tricky.

A more interesting (and informative) plot is that of the pdf:


persp(cop,dCopula,
      theta=60,phi=45,
      xlab="x",ylab="y",
      zlab="Copula density")

Next we shall head towards interactive 3D plots. The rgl package will make such plots, but it needs us to supply the heights of the surface over a grid of points. Fortunately, the persp function returns these:


pts = persp(cop,dCopula)

When you are not sure about the structure of an R object the str function is a good thing to try:


str(pts)

It shows that pts is a list of x values (a vector), y values (another vector), and z values (a matrix). Armed with these we load the rgl package:


library(rgl)

First we need to open a window for drawing 3D. It is different from the traditional R graphics window:


open3d()

A small blank window will pop up. Let's add the surface using the persp3d function:


with(pts,persp3d(x,y,z,color='grey'))

Not bad! Try dragging the surface with your mouse. However, the surface is too shiny. This looks particularly ugly in a print out. The shininess if caused by the default light used by rgl. It is like a point source of light like an electric bulb. We prefer something like a fluorescent light, more diffuse. First let's turn off the existing light:


rgl.pop('lights')

The surface is now pitch dark. Next add a "soft" light:


rgl.light(specular='black')

Each light source has a specular aspect (like a bright point source) as well as a diffuse aspect (soft glow). The specular='black' turns of the specular part. Now the surface will seem less shiny. It is also possible to choose the surface material using the material3d function.

Plotting the contour of a surface is another way to visualise it.


contour(cop,dCopula)

A contour plot can reveal curved ridges that may not be readily visible in an interactive 3D plot.

Next let's generate some random numbers from a copula:


x = rCopula(100,cop)
dim(x)

Here is a scatterplot:


plot(x,xlim=c(0,1),ylim=c(0,1))

We could have done the same simulation even without explicitly using a copula, because here the copula is very simple:


getSigma(cop)
y = rmvnorm(100,sigma=getSigma(cop))

Now just "fold back" the marginals:


u = pnorm(y)

and plot:


plot(u)

Different types of copula

R allows different types of copulas: Normal, t and Archimedean. For both normal and t-copulas we have a rich choice of correlation matrices. They may be

"unstructured" (general),
"exchangeable" (all off-diagonals equal),
"AR(1)" (all super diagonal elements are equal, the next super diagonals are their squares, and so on),
"Toeplitz" (a toeplitz matrix).

The different forms required different number of parameters. For example, both "exchangeable" and "AR(1)" require 1, while "unstructured" requires d(d-1)/2, where d is the dimension.

The dispstr option is used to choose the structure:


cop2 = normalCopula(dim = 3, param=c(0.5,0.6,0.7), dispstr="un")

It's always a good idea to check the correlation matrix:


getSigma(cop2)

Here are a few more:


cop3 = normalCopula(dim = 3, param=c(0.5), dispstr="ar1")
getSigma(cop3)

The following multi-panel plot will compare the pdf's of these copulas using contours:


oldpar = par(mfrow=c(2,2))
contour(cop,dCopula)
contour(cop2,dCopula)
contour(cop3,dCopula)
par(oldpar)

A t-copula is the copula underlying a multivariate t-distribution.


cop4 = tCopula(param=0.8,dim=2,df=5)

Here again we can choose any of the above structures for the correlation matrix. The default (used above) is "exchangeable".

R provides many useful families of Archimedean copulas, Clayton copula being just one of them:


cop5 = claytonCopula(0.3, dim=2)

Here is the definition of the Clayton copula (and its generator) from Nelsen's book:

Clayton copula

Combining marginals with a copula

The mvdc function combines marginals with a copula. Below we create a bivariate distribution by plugging normal marginals into the Clayton copula just now created:


mult1 = mvdc(
  copula=cop5,
  margins=c('norm','norm'),
  paramMargins=list(
    list(mean=1,sd=1),
    list(mean=0,sd=2)
  )
)

Now mult1 is a bivariate distribution. We can generate data from it:


data = rMvdc(mult1, 100)

plot(data)

Just check the marginals:


hist(data[,1])

Let's plug normal, gamma and Cauchy marginals into a Clayton copula.

We need to know the parameters (in R's notation) for the marginals:


formals(dcauchy)


args(dcauchy)

Similarly,


formals(dgamma)
args(dgamma)

Now let's create the trivariate distribution:


mult2 = mvdc(
  copula=claytonCopula(0.3,dim=3),
  margins=c('norm','gamma','cauchy'),
  paramMargins=list(
    list(mean=1,sd=1),
    list(shape=2,rate=0.5),
    list(location=1,scale=1)
  )
)

We can generate data from it:


someData = rMvdc(50,mult2)

plot3d(someData,ty='s',col='grey')

The last command uses the rgl package to make a 3D scatterplot, where each point is shown as a sphere (ty='s').

That's all for this lab session. We shall learn about fitting a copula later.

Comment Box is loading comments...