[Update:[Mon Nov 03 IST 2014]]

Analysis of Discrete Data

M.Stat. II, 2014

According to ISI Brochure

syllbus

Reference books

Take your mouse over a date to see the reading assignments for that date. Clicking on a date will take you to additional material (if any) for that date.
July
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31
August
Sun Mon Tue Wed Thu Fri Sat
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
September
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30
October
Sun Mon Tue Wed Thu Fri Sat
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31
November
Sun Mon Tue Wed Thu Fri Sat
1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30
Legend
Introduction
Contingency tables
GLM
Logistic regression
Structural eqn model
Loglinear models
Midsemestral exams
Semestral exams

Additional material

An unusual experiment leading to categorical data (July 21)

Assessing performance of some statistical method by simulation (July 24)

First you write a function that takes the data set as input, carries out the statistical method and returns the performance criterion you are using to assess the method. In the following example we are using the simple CI formula for binomial proportion, and the performance criteria is the length of the interval.
howGood = function(nHead, n) {
  pHat = nHead/n
  seHat = sqrt(pHat*(1-pHat)/n)
  radius = 1.96*seHat
  upperLim = min(1,pHat + radius) 
  lowerLim = max(0,pHat - radius)
  return(upperLim - lowerLim)
}
Now take a true vlue for the parameter, geeraate lots of data sets and apply your function on each of the data sets. Find the mean value of the performance criterion.
  
trueP = 0.5
X = rbinom(1000,100,prob=trueP)
val = c()
for(i in 1:1000) val[i] = howGood(X[i],100)
mean(val)
Loops are always cumbersome to write and, in R, they also tend to be slow. A somewhat faster and cleaner alternative is to use the sapply function:
trueP = 0.5
X = rbinom(1000,100,prob=trueP)
val= sapply(X,howGood,n=100)
mean(val)
Whichever way you use the average performace criterion tells you how your statistical method performs for particular true value of the parameter you have chosen (0.5, in this example). To learn about the global behaviour throughout the entire paramer space, you need to take a grid of true values and repeat the entire exercise for each value. You can again use sapply, but here I have used an explicit loop.
performance= c()
truePList = seq(0.01,0.99,0.01)
for(i in 1:length(truePList)) {
  X = rbinom(1000,100,prob=truePList[i])
  val= sapply(X,howGood,n=100)
  performance[i] = mean(val)
}
plot(truePList,performance,ty='l')
This plot gives you a picture of how well your method performs. Such a plot by itself, however, is not of much value. When you have different methods, then comparing them by overlaying such plots. Suppose that another method produces a vector called performance2. Then the overlaying may be done as
lines(truePList,performance2)
Here is a bunch of binomial CI's (from Agresti's website). You may like to compare them using this technique.

Intro to working with graphical data in R (July 31)

Here is a transcipt of all the things we did in class today. Don't feel disheartened with yourself (or annoyed with me) if you could not follow everything. R is not difficult. But, like every other language, needs a little hands on practice. Carefully go through the commands given below. Actually execute them in R! R is available in the CSSC machines (4th floor, S N Bose Bhavan). The names in bold refer to functions provided by R. Experiment by modifying the lines to suit your need. Look up the R help by typing a question mark followed by the name of the function you want to help about.
Loading categorical data
frq = matrix(c(509,398,116,104),2,2) rownames(frq)=c("Female","Male") colnames(frq)=c("Yes","No/undecided") belief.data = read.table("belief.txt",head=T) belief.data = read.csv("belief.csv",head=T) ##Download the file afterlife.xls ##Open it and copy the data table (including the variabke names). #install.packages('psych') library(psych) belief.data = read.clipboard.tab()
Looking at categorical data
levels(belief.data[,1]) levels(belief.data[,2]) #oops! belief.data[,2] = as.factor(belief.data[,2]) tbl1 = table(belief.data) (tbl2 = addmargins(tbl1)) xtabs(~.,belief.data) DF = as.data.frame(UCBAdmissions) xtabs(Freq ~ Gender + Admit, DF) (tbl=xtabs(Freq ~ Gender + Dept + Admit, DF)) ftable(tbl)
Plotting (1D)
x = sample(c(1,2,3,4,5,6),100,replace=T) barplot(table(x)) barplot(table(x)/length(x)) x = rpois(100,lambda=2) old = par(mfrow=c(2,1)) barplot(table(x)/length(x)) barplot(dpois(0:max(x),lambda=2)) par(old)
Plotting (2D)
library(rgl) demo(rgl) barplot.3d = function(tbl) { m = nrow(tbl) n = ncol(tbl) scl = max(m,n)/max(tbl) cat("scl = ",scl,"\n") for(i in 1:m) { for(j in 1:n) { binplot.3d(c(i,i+0.9),c(j,j+0.9),scl*tbl[i,j]) } } } rgl.open() barplot.3d(tbl1) #install.packages('vcd') library(vcd) data("UCBAdmissions") x = aperm(UCBAdmissions, c(2, 1, 3)) dimnames(x)[[2]] = c("Yes", "No") names(dimnames(x)) = c("Sex", "Admit?", "Department") ftable(x) fourfold(margin.table(x, c(1, 2))) data("HairEyeColor") (tab = margin.table(HairEyeColor, c(2,1))) sieve(tab, sievetype = "expected", shade = TRUE) sieve(tab, shade = TRUE) (x = margin.table(HairEyeColor, c(1, 2))) assoc(x) assoc(x, main = "Relation between hair and eye color", shade = TRUE)
Testing
tbl2 x = c(509,398) n = c(625,502) prop.test(x, n) prop.test(tbl2[-3,1],tbl2[-3,3]) gamma.f = function(x) { m = nrow(x) n = ncol(x) res = matrix(0,m, n) for(i in 1:(m-1)) for(j in 1:(n-1)) res[i,j] = x[i,j]*sum(x[(i+1):m,(j+1):n]) C = sum(res) for(i in 1:(n-1)) for(j in 2:m) res[i,j] = x[i,j]*sum(x[(i+1):m,1:(j-1)]) D = sum(res) gamma = (C-D)/(C+D) list( gamma=gamma, C=C, D=D) } tbl1 gamma.f(tbl1)

Odds ratio and χ2test (Aug 04)

Working with odds ratio in R
#install.packages('vcd') #You need to do this only once for a machine.
library(vcd) #Once for each session

(lor  =  oddsratio(CoalMiners))

summary(lor)
     
confint(lor)
     
plot(lor,
     xlab = "Age Group",
     main = "Breathelessness and Wheeze in Coal Miners")
     
g  =  seq(25, 60, by = 5)
m  =  lm(lor ~ g + I(g^2))
lines(fitted(m), col = "red")
Association measures based on χ2
Our text book is strangely silent about various useful association measures based on χ2. Please read this document for a practical discussion of these. The following R example is based on the discussion there.
(x = matrix(c(65,35,25,25),2,2))
(y = matrix(c(75,25,25,25),2,2))
chisq.test(x,correct=F) #Don't do Yate's continuity correction
chisq.test(y,correct=F) 
chisq.test(4*x,correct=F)

assocstats(x) 
assocstats(y) 
assocstats(4*x) 
Simulating a 2×2 table under independence for multinomial scheme
val = numeric(1000)
for(i in 1:1000) {
   rw = sample(2,100,repl=T)
   cl = sample(2,100,repl=T)
   dat = data.frame(rw,cl)
   val[i] = oddsratio(table(dat))
}

hist(val)
qqnorm(val)

Testing association for ordinal contingency tables (Aug 07)

How categorisation reduces correlation
library(mvtnorm)
disp = matrix(c(1,0.9,0.9,1),2,2)
dat = rmvnorm(1000,sigma=disp)
f = function(x,howMany) {
  brX = quantile(x,seq(0,1,len=howMany))
  dX = cut(x,brX,include.lowest=T)
  scoreX = brX[-1]-diff(brX)/2
  scoreX[dX]
}
scr=apply(dat,2,f,howMany=30) #reducing howMany from 30 to 3 will reduce the correlation
cor(scr)
Using ridit scoring
ridit=function(x) {
  tx = table(x)/length(x)
  nCat = length(tx)
  sc=c(0,cumsum(tx)[-nCat])
  rid=sc+tx/2
  rid[x]
}
Understanding tetrachoric correlation
library(psych)
rho = 0.9
disp = matrix(c(1,rho,rho,1),2,2)
dat = rmvnorm(1000,sigma=disp)
tmp1=cut(dat[,1],br=c(-4,0,4))
tmp2=cut(dat[,2],br=c(-4,0,4))
xx = data.frame(tmp1,tmp2)
tetrachoric(table(xx))

Fisher's exact test (Aug 11)

Read about ABX tests from Wikipedia.

Here are some actual audio ABX tests that you can perform on yourself.

The command to perform Fisher's exact test in R on a contingency table x is

fisher.test(x)
If your table/freq is too large, then the exact test may prove too resource intensive. Then you can use
fisher.test(x,simul=T)
to use a Monte Carlo version, an exact test done approximately!!!

See this paper by Mehta and Patel for the network algorithm used for Fisher's exact test.

Multiway table and Simpson's paradox (Aug 14)

You need to download this data set first.
library(psych)
library(vcd)

dat = read.clipboard.tab()

dim(dat)
table(dat)

xtabs(~Death,dat)

xtabs(~Defendant,dat)

xtabs(~Victim,dat)

vic.def = xtabs(~Victim+Defendant,dat)
fourfold(vic.def)

oddsratio(vic.def)

oddsratio(aperm(table(dat),c(1,3,2)))

Simpson starts
oddsratio(xtabs(~Death+Defendant+Victim,dat))

oddsratio(xtabs(~Death+Victim,dat))

ft = xtabs(~.,dat)
mt = margin.table(ft,1:2)
(dp=ft[,,"Yes"]/mt)

plot(1,ylim=range(dp),xlim=c(0,3),ty='n')

text(2,dp["White","White"],'W')

points(2,dp["White","White"],cex=0.05*mt["White","White"])


text(1,dp["White","Black"],'W')
points(1,dp["White","Black"],cex=0.05*mt["White","Black"])

text(2,dp["Black","White"],'B')
points(2,dp["Black","White"],cex=0.05*mt["Black","White"])

text(1,dp["Black","Black"],'B')
points(1,dp["Black","Black"],cex=0.05*mt["Black","Black"])


lines(c(1,2), dp["Black",],lwd=3)

lines(c(1,2), dp["White",],lwd=3)

(vicTot=margin.table(ft,1))
(vicDth=margin.table(ft[,,"Yes"],1))

lines(c(1,2),vicDth/vicTot,lty=2,lwd=3)

Binary data: MCMC and probit etc(Aug 21)

Binary "contingency tables"
A relevant paper by Chen et al containing the Darwin Finch data set.
Probit analysis
Here is a typical victim of the analysis we are going to learn today. Click on the image for more details about cruelty on research animals.

Binary data: MCMC and probit etc(Aug 25)

Warm up with a hypothetical data set
The hypothetical poison data set used in class.
dat = read.csv('probdat.csv',head=T)
names(dat)
propDeath = with(dat,Death/Size)
with(dat,plot(Dose,propDeath))
with(dat,plot(Dose,qnorm(propDeath)))
with(dat,lm(qnorm(propDeath)~Dose))
Ordinal explanatory variable: Snoring data
This data set is from our text book, and you may download it snore.csv.
(dat = read.csv("snore.csv",head=T))
rowTot = apply(dat[,-1],1,sum)
prob = rowTot/sum(rowTot)

ridit=function(prob) {
  nCat = length(prob)
  sc=c(0,cumsum(prob)[-nCat])
  cat(sc)
  sc+prob/2
}

(x =  ridit(prob))

(propYes = dat[,2]/rowTot)
(probit = qnorm(propYes))
plot(x,probit)
(fit1 = lm(probit~x))

abline(fit1$coef)

(oddsYes = dat[,2]/dat[,3])
(logit = log(propYes))
plot(x,logit)
(fit2 = lm(logit~x))

abline(fit2$coef)
Horseshoe crab data set
The following photo taken from this web page shows a female horseshoe crab, its male mate and 4 satellite males.
Here is the data set.
dat = read.table("crab.txt",head=T)
dim(dat)
head(dat)
tail(dat)
with(dat,plot(W,Sa))
with(dat,lines(lowess(W,Sa),col='red',lwd=3))
(fit = glm(Sa ~ W,family="poisson",data=dat))
summary(fit)

with(dat,plot(W,Sa,col=S))
with(dat,lines(lowess(W,Sa),col='red',lwd=3))
(fit = glm(Sa ~ W+as.factor(S),family="poisson",data=dat))
summary(fit)

CMH tests(Sep 22)

Our textbook discusses only the 2×2×k case. You may read about the generalised case from the SAS documentation on generalised CMH test.

Multicategiry and ordinal logistic analysis (Oct 16)

allig = read.table('allig.txt',head=T)
#--------------------------------------
names(allig)
dim(allig)
#--------------------------------------
head(allig)
#--------------------------------------
library(nnet)
(fit = multinom(food ~ length, data=allig))
#--------------------------------------
pi.hat = fitted(fit)
pi.hat
#--------------------------------------
pi.hat %*% rep(1,3)
#--------------------------------------
plot(allig$length,pi.hat[,1],ylim=range(pi.hat),ty='l') #oops!
#--------------------------------------
ord = order(allig$length)
plot(allig$length[ord],pi.hat[ord,1],ylim=range(pi.hat),ty='l')
#--------------------------------------
lines(allig$length[ord],pi.hat[ord,2],col='red')
lines(allig$length[ord],pi.hat[ord,3],col='blue')
#--------------------------------------
allig$food
#--------------------------------------
relevel(allig$food,ref="O"))
#--------------------------------------
allig$food2 = relevel(allig$food,ref="O")
(fit2 = multinom(food2 ~ length, data=allig))
#---------Loading a data in dta format (used by STATA)
library(foreign)
#--------------------------
dat = read.dta("hsbdemo.dta")
#read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
#--------------------------
dim(dat)
names(dat)
head(dat)
#--------------------------
(fit = multinom(prog ~ ses + write, data = dat))
#--------------------------
summary(fit)
#--------------------------
names(summary(fit))
#--------------------------
(testStat = summary(fit)$coeff/summary(fit)$standard.errors)
#--------------------------
(1 - pnorm(abs(testStat))) * 2
#------------------------
frq = c(44,47,118,23,32,
18,28,86,39,48,
36,34,53,18,23,
12,18,62,45,51)

gender = c("F","M")

party = c("Dem","Rep")

ideology = 1:5
#--------------------------

dat = data.frame(expand.grid(ideology,party,gender),frq)

names(dat) = c("ideo","party","gender","frq")
#--------------------------

ftable(xtabs(frq~ideo+party+gender,dat))
#--------------------------
library(MASS)
fit = polr(ideo~party+gender,weight=frq,data=dat) #oops
#--------------------------
fit = polr(as.factor(ideo)~party+gender,weight=frq,data=dat)
(fit=polr(as.factor(ideo)~party,weight=frq,data=dat))
summary(fit)
#---------------------
confint(fit)



Structural Equation Modelling (Nov 03)

Today we shall fit a structural equation model (SEM) to a real life data collected by a wellknown company in Kolkata. The data were collected by using a questionnaire where most of the questions were to be answered in a 7-point Likert scale.

We first load the relevant packages.

library(sem)
library(psych)
Now load the data set.
dat = read.table("data.txt",head=TRUE)
In this example we shall focus on only the following items. We start by extracting the relvant columns of the data.

start = (1:ncol(dat))[colnames(dat)=="q116_a"]
end = (1:ncol(dat))[colnames(dat)=="q125_g"]

used = start:end;
Our aim is to fit the following SEM where "distinctiveness", "visibility", "understandability", "legitimacy" and "relevance" are latent variables.
For each manifest variable we also have the double headed loop arrows for variances that we have not shown in the diagram o avoid clutter. The entire model in RAM notation is in the file distmod.spc. The extension .spc is just my arbitrary choice. Let us take a look at a few lines from the file:
dist -> vis, NA, 1       
dist -> und, lam1, NA
dist -> leg, lam2, NA
dist -> rel, lam3, NA
The general syntax for expressing a single-headed straight arrow is like
from -> to , parameter name, initial value
If the parameter name is NA, then the arrow has a prespecified value associated with it (the initial value). If the initial value is NA, then R chooses its own suitable initial value. It is possible to associate the same parameter with multiple arrows. Just use the same parameter name, as done below.
vis -> q116_a, gam11, NA
vis -> q116_b, gam11, NA
The double-headed loop arrow syntax is similar:
dist <-> dist, sig0, NA

vis <-> vis, sig1, NA
und <-> und, sig1, NA
Next we have to load the model into R.
mod1 = specifyModel("distmod.spc")
SEM does not use the raw data. It works with the correlation matrix. It is important that the correlation matrix is nonnegative definite. In presence of missing values this is a problem, because then there are two ways to compute a correlation matrix:
  1. when computing the (i,j)-th entry, use all cases where both of these variables are available. Then the correlation matrix is not gurateeed to be nonnegative definite.
  2. first throw away all cases with at least a single missing value for some variable. Then compute all the correlations based on the remaining data. This will always produce a nonnegative definite matrix.
We can proceed in a naive way by treating the ordinal category numbers as the scores, as doen below.
corMat1 = cor(dat[,used],use="complete.obs")
But a better altrrnative is to respect the ordinal nature of the data, and use polychoric correlation. Unfortunately, the polychoric function of R does handle missing values nicely. So we need to fiter them out manually as follows:
bad = apply(dat[,used],1,function(x) any(is.na(x)))
corMat2 = polychoric(dat[!bad,used])
Finally we can fit the SEM.
res = sem(mod1,corMat1,66) #oops!
res = sem(mod1,corMat1,66,maxiter=200)
If we want to use the polychoric correlation matrix then:
res = sem(mod1,corMat2$rho,66)