[Update:[Mon Nov 03 IST 2014]]

Analysis of Discrete Data

M.Stat. II, 2014

According to ISI Brochure

Reference books

An Introduction to Categorical Data Analysis (2nd edn, 2007) by Alan Agresti [Download]
Analysis of Ordinal Categorical Data (2nd edn, 2010) by Alan Agresti[Download]
Analysis of Categorical Data (2nd edn, 2002) by Alan Agresti
Structural Equation Modelling : A Bayesian Approach (2007) by Sik-Yum Lee[Download]
Latent variable models : An Introduction to Factor, Path and Structural Equation Analysis (2003) by Loehlin[Download]

Take your mouse over a date to see the reading assignments for that date. Clicking on a date will take you to additional material (if any) for that date.

Additional material

An unusual experiment leading to categorical data (July 21)

Milgram's Obedience Video

Assessing performance of some statistical method by simulation (July 24)

First you write a function that takes the data set as input, carries out the statistical method and returns the performance criterion you are using to assess the method. In the following example we are using the simple CI formula for binomial proportion, and the performance criteria is the length of the interval.

howGood = function(nHead, n) {
  pHat = nHead/n
  seHat = sqrt(pHat*(1-pHat)/n)
  radius = 1.96*seHat
  upperLim = min(1,pHat + radius) 
  lowerLim = max(0,pHat - radius)
  return(upperLim - lowerLim)
}

Now take a true vlue for the parameter, geeraate lots of data sets and apply your function on each of the data sets. Find the mean value of the performance criterion.

  
trueP = 0.5
X = rbinom(1000,100,prob=trueP)
val = c()
for(i in 1:1000) val[i] = howGood(X[i],100)
mean(val)

Loops are always cumbersome to write and, in R, they also tend to be slow. A somewhat faster and cleaner alternative is to use the sapply function:

trueP = 0.5
X = rbinom(1000,100,prob=trueP)
val= sapply(X,howGood,n=100)
mean(val)

Whichever way you use the average performace criterion tells you how your statistical method performs for particular true value of the parameter you have chosen (0.5, in this example). To learn about the global behaviour throughout the entire paramer space, you need to take a grid of true values and repeat the entire exercise for each value. You can again use sapply, but here I have used an explicit loop.

performance= c()
truePList = seq(0.01,0.99,0.01)
for(i in 1:length(truePList)) {
  X = rbinom(1000,100,prob=truePList[i])
  val= sapply(X,howGood,n=100)
  performance[i] = mean(val)
}
plot(truePList,performance,ty='l')

This plot gives you a picture of how well your method performs. Such a plot by itself, however, is not of much value. When you have different methods, then comparing them by overlaying such plots. Suppose that another method produces a vector called performance2. Then the overlaying may be done as

lines(truePList,performance2)

Here is a bunch of binomial CI's (from Agresti's website). You may like to compare them using this technique.

Intro to working with graphical data in R (July 31)

Here is a transcipt of all the things we did in class today. Don't feel disheartened with yourself (or annoyed with me) if you could not follow everything. R is not difficult. But, like every other language, needs a little hands on practice. Carefully go through the commands given below. Actually execute them in R! R is available in the CSSC machines (4th floor, S N Bose Bhavan). The names in bold refer to functions provided by R. Experiment by modifying the lines to suit your need. Look up the R help by typing a question mark followed by the name of the function you want to help about.

Loading categorical data

frq = matrix(c(509,398,116,104),2,2)
rownames(frq)=c("Female","Male")
colnames(frq)=c("Yes","No/undecided")

belief.data = read.table("belief.txt",head=T)

belief.data = read.csv("belief.csv",head=T)

##Download the file afterlife.xls
##Open it and copy the data table (including the variabke names).
#install.packages('psych')
library(psych)
belief.data = read.clipboard.tab()

Looking at categorical data

levels(belief.data[,1])

levels(belief.data[,2]) #oops!

belief.data[,2] = as.factor(belief.data[,2])

tbl1 = table(belief.data)

(tbl2 = addmargins(tbl1))

xtabs(~.,belief.data)

DF = as.data.frame(UCBAdmissions)

xtabs(Freq ~ Gender + Admit, DF)

(tbl=xtabs(Freq ~ Gender + Dept + Admit, DF))

ftable(tbl)

Plotting (1D)

x = sample(c(1,2,3,4,5,6),100,replace=T)

barplot(table(x))

barplot(table(x)/length(x))

x = rpois(100,lambda=2)
old = par(mfrow=c(2,1))
barplot(table(x)/length(x))
barplot(dpois(0:max(x),lambda=2))
par(old)

Plotting (2D)

library(rgl)

demo(rgl)

barplot.3d = function(tbl) {
  m = nrow(tbl)
  n = ncol(tbl)

  scl = max(m,n)/max(tbl)
  cat("scl = ",scl,"\n")

  for(i in 1:m) {
    for(j in 1:n) {
      binplot.3d(c(i,i+0.9),c(j,j+0.9),scl*tbl[i,j])
    }
  }
}

rgl.open()
barplot.3d(tbl1)

#install.packages('vcd')
library(vcd)
data("UCBAdmissions")

x = aperm(UCBAdmissions, c(2, 1, 3))

dimnames(x)[[2]] = c("Yes", "No")
names(dimnames(x)) = c("Sex", "Admit?", "Department")

ftable(x)

fourfold(margin.table(x, c(1, 2)))

data("HairEyeColor")

(tab = margin.table(HairEyeColor, c(2,1)))

sieve(tab, sievetype = "expected", shade = TRUE)

sieve(tab, shade = TRUE)


(x = margin.table(HairEyeColor, c(1, 2)))
assoc(x)

assoc(x, main = "Relation between hair and eye color", shade = TRUE)

Testing

tbl2
x = c(509,398)
n = c(625,502)
prop.test(x, n)

prop.test(tbl2[-3,1],tbl2[-3,3])

gamma.f = function(x) {
  m = nrow(x)
  n = ncol(x)

  res = matrix(0,m, n)
  for(i in 1:(m-1)) 
    for(j in 1:(n-1))
      res[i,j] = x[i,j]*sum(x[(i+1):m,(j+1):n])

  C = sum(res)

  for(i in 1:(n-1))
    for(j in 2:m)
      res[i,j] = x[i,j]*sum(x[(i+1):m,1:(j-1)])

  D = sum(res)

  gamma = (C-D)/(C+D)

  list( gamma=gamma, C=C, D=D)
}

tbl1
gamma.f(tbl1)

Odds ratio and χ²test (Aug 04)

Working with odds ratio in R

#install.packages('vcd') #You need to do this only once for a machine.
library(vcd) #Once for each session

(lor  =  oddsratio(CoalMiners))

summary(lor)
     
confint(lor)
     
plot(lor,
     xlab = "Age Group",
     main = "Breathelessness and Wheeze in Coal Miners")
     
g  =  seq(25, 60, by = 5)
m  =  lm(lor ~ g + I(g^2))
lines(fitted(m), col = "red")

Association measures based on χ²

Our text book is strangely silent about various useful association measures based on χ². Please read this document for a practical discussion of these. The following R example is based on the discussion there.

(x = matrix(c(65,35,25,25),2,2))
(y = matrix(c(75,25,25,25),2,2))
chisq.test(x,correct=F) #Don't do Yate's continuity correction
chisq.test(y,correct=F) 
chisq.test(4*x,correct=F)

assocstats(x) 
assocstats(y) 
assocstats(4*x)

Simulating a 2×2 table under independence for multinomial scheme

val = numeric(1000)
for(i in 1:1000) {
   rw = sample(2,100,repl=T)
   cl = sample(2,100,repl=T)
   dat = data.frame(rw,cl)
   val[i] = oddsratio(table(dat))
}

hist(val)
qqnorm(val)

Testing association for ordinal contingency tables (Aug 07)

How categorisation reduces correlation

library(mvtnorm)
disp = matrix(c(1,0.9,0.9,1),2,2)
dat = rmvnorm(1000,sigma=disp)
f = function(x,howMany) {
  brX = quantile(x,seq(0,1,len=howMany))
  dX = cut(x,brX,include.lowest=T)
  scoreX = brX[-1]-diff(brX)/2
  scoreX[dX]
}
scr=apply(dat,2,f,howMany=30) #reducing howMany from 30 to 3 will reduce the correlation
cor(scr)

Using ridit scoring

ridit=function(x) {
  tx = table(x)/length(x)
  nCat = length(tx)
  sc=c(0,cumsum(tx)[-nCat])
  rid=sc+tx/2
  rid[x]
}

Understanding tetrachoric correlation

library(psych)
rho = 0.9
disp = matrix(c(1,rho,rho,1),2,2)
dat = rmvnorm(1000,sigma=disp)
tmp1=cut(dat[,1],br=c(-4,0,4))
tmp2=cut(dat[,2],br=c(-4,0,4))
xx = data.frame(tmp1,tmp2)
tetrachoric(table(xx))

Fisher's exact test (Aug 11)

Read about ABX tests from Wikipedia.

Here are some actual audio ABX tests that you can perform on yourself.

The command to perform Fisher's exact test in R on a contingency table x is

fisher.test(x)

If your table/freq is too large, then the exact test may prove too resource intensive. Then you can use

fisher.test(x,simul=T)

to use a Monte Carlo version, an exact test done approximately!!!

See this paper by Mehta and Patel for the network algorithm used for Fisher's exact test.

Multiway table and Simpson's paradox (Aug 14)

You need to download this data set first.

library(psych)
library(vcd)

dat = read.clipboard.tab()

dim(dat)
table(dat)

xtabs(~Death,dat)

xtabs(~Defendant,dat)

xtabs(~Victim,dat)

vic.def = xtabs(~Victim+Defendant,dat)
fourfold(vic.def)

oddsratio(vic.def)

oddsratio(aperm(table(dat),c(1,3,2)))

Simpson starts
oddsratio(xtabs(~Death+Defendant+Victim,dat))

oddsratio(xtabs(~Death+Victim,dat))

ft = xtabs(~.,dat)
mt = margin.table(ft,1:2)
(dp=ft[,,"Yes"]/mt)

plot(1,ylim=range(dp),xlim=c(0,3),ty='n')

text(2,dp["White","White"],'W')

points(2,dp["White","White"],cex=0.05*mt["White","White"])


text(1,dp["White","Black"],'W')
points(1,dp["White","Black"],cex=0.05*mt["White","Black"])

text(2,dp["Black","White"],'B')
points(2,dp["Black","White"],cex=0.05*mt["Black","White"])

text(1,dp["Black","Black"],'B')
points(1,dp["Black","Black"],cex=0.05*mt["Black","Black"])


lines(c(1,2), dp["Black",],lwd=3)

lines(c(1,2), dp["White",],lwd=3)

(vicTot=margin.table(ft,1))
(vicDth=margin.table(ft[,,"Yes"],1))

lines(c(1,2),vicDth/vicTot,lty=2,lwd=3)

Binary data: MCMC and probit etc(Aug 21)

Binary "contingency tables"

A relevant paper by Chen et al containing the Darwin Finch data set.

Probit analysis

Here is a typical victim of the analysis we are going to learn today. Click on the image for more details about cruelty on research animals.

Binary data: MCMC and probit etc(Aug 25)

Warm up with a hypothetical data set

The hypothetical poison data set used in class.

dat = read.csv('probdat.csv',head=T)
names(dat)
propDeath = with(dat,Death/Size)
with(dat,plot(Dose,propDeath))
with(dat,plot(Dose,qnorm(propDeath)))
with(dat,lm(qnorm(propDeath)~Dose))

Ordinal explanatory variable: Snoring data

This data set is from our text book, and you may download it snore.csv.

(dat = read.csv("snore.csv",head=T))
rowTot = apply(dat[,-1],1,sum)
prob = rowTot/sum(rowTot)

ridit=function(prob) {
  nCat = length(prob)
  sc=c(0,cumsum(prob)[-nCat])
  cat(sc)
  sc+prob/2
}

(x =  ridit(prob))

(propYes = dat[,2]/rowTot)
(probit = qnorm(propYes))
plot(x,probit)
(fit1 = lm(probit~x))

abline(fit1$coef)

(oddsYes = dat[,2]/dat[,3])
(logit = log(propYes))
plot(x,logit)
(fit2 = lm(logit~x))

abline(fit2$coef)

Horseshoe crab data set

The following photo taken from this web page shows a female horseshoe crab, its male mate and 4 satellite males.

Here is the data set.

dat = read.table("crab.txt",head=T)
dim(dat)
head(dat)
tail(dat)
with(dat,plot(W,Sa))
with(dat,lines(lowess(W,Sa),col='red',lwd=3))
(fit = glm(Sa ~ W,family="poisson",data=dat))
summary(fit)

with(dat,plot(W,Sa,col=S))
with(dat,lines(lowess(W,Sa),col='red',lwd=3))
(fit = glm(Sa ~ W+as.factor(S),family="poisson",data=dat))
summary(fit)

CMH tests(Sep 22)

Our textbook discusses only the 2×2×k case. You may read about the generalised case from the SAS documentation on generalised CMH test.

Multicategiry and ordinal logistic analysis (Oct 16)

allig = read.table('allig.txt',head=T)
#--------------------------------------
names(allig)
dim(allig)
#--------------------------------------
head(allig)
#--------------------------------------
library(nnet)
(fit = multinom(food ~ length, data=allig))
#--------------------------------------
pi.hat = fitted(fit)
pi.hat
#--------------------------------------
pi.hat %*% rep(1,3)
#--------------------------------------
plot(allig$length,pi.hat[,1],ylim=range(pi.hat),ty='l') #oops!
#--------------------------------------
ord = order(allig$length)
plot(allig$length[ord],pi.hat[ord,1],ylim=range(pi.hat),ty='l')
#--------------------------------------
lines(allig$length[ord],pi.hat[ord,2],col='red')
lines(allig$length[ord],pi.hat[ord,3],col='blue')
#--------------------------------------
allig$food
#--------------------------------------
relevel(allig$food,ref="O"))
#--------------------------------------
allig$food2 = relevel(allig$food,ref="O")
(fit2 = multinom(food2 ~ length, data=allig))
#---------Loading a data in dta format (used by STATA)
library(foreign)
#--------------------------
dat = read.dta("hsbdemo.dta")
#read.dta("http://www.ats.ucla.edu/stat/data/hsbdemo.dta")
#--------------------------
dim(dat)
names(dat)
head(dat)
#--------------------------
(fit = multinom(prog ~ ses + write, data = dat))
#--------------------------
summary(fit)
#--------------------------
names(summary(fit))
#--------------------------
(testStat = summary(fit)$coeff/summary(fit)$standard.errors)
#--------------------------
(1 - pnorm(abs(testStat))) * 2
#------------------------
frq = c(44,47,118,23,32,
18,28,86,39,48,
36,34,53,18,23,
12,18,62,45,51)

gender = c("F","M")

party = c("Dem","Rep")

ideology = 1:5
#--------------------------

dat = data.frame(expand.grid(ideology,party,gender),frq)

names(dat) = c("ideo","party","gender","frq")
#--------------------------

ftable(xtabs(frq~ideo+party+gender,dat))
#--------------------------
library(MASS)
fit = polr(ideo~party+gender,weight=frq,data=dat) #oops
#--------------------------
fit = polr(as.factor(ideo)~party+gender,weight=frq,data=dat)
(fit=polr(as.factor(ideo)~party,weight=frq,data=dat))
summary(fit)
#---------------------
confint(fit)

Structural Equation Modelling (Nov 03)

Today we shall fit a structural equation model (SEM) to a real life data collected by a wellknown company in Kolkata. The data were collected by using a questionnaire where most of the questions were to be answered in a 7-point Likert scale.

We first load the relevant packages.

library(sem)
library(psych)

Now load the data set.

dat = read.table("data.txt",head=TRUE)

In this example we shall focus on only the following items.

116(a): To what extent do policies and systems exist with respect to Work Organization ?
116(b): To what extent do policies and systems exist with respect to Performance Appraisal ?
116(c): To what extent do policies and systems exist with respect to Salary and Benefits ?
116(d): To what extent do policies and systems exist with respect to Competence Development ?
116(e): To what extent do policies and systems exist with respect to Career Management ?
116(f): To what extent do policies and systems exist with respect to Information Sharing ?
117: To what extent are HR Policies and Systems formalized in writing ?
118: To what extent is the employee population covered by HR Policies and Systems ?
119: To what extent do you understand how HRM Policies and Systems work ?
120: To what extent does the HRM Function enjoy credibility ?
121: To what extent is the HRM Function given importance and status ?
122(a): (Are you satisfied with the) Recruitment and Selection
122(b): (Are you satisfied with the) Induction
122(c): (Are you satisfied with the) Work Design
122(d): (Are you satisfied with the) Remuneration
122(e): (Are you satisfied with the) Recognition
122(f): (Are you satisfied with the) Training and Development
122(g): (Are you satisfied with the) Performance Management
122(h): (Are you satisfied with the) Employee Communication
123: To what extent are investments made in and resources provided to the HRM Function ?
124: To what extent do the HRM Policies & Systems enable you to align your goals, needs & aspirations with the organization's goals, needs and aspirations ?
125(a): Exhibit leadership for HRM
125(b): Define and communicate HRM vision for the future
125(c): Educate and influence senior management on HRM issues
125(d): Have adequate knowledge of HRM practices
125(e): Are knowledgeable about competitors' HRM practices
125(f): Focus on the quality of HRM deliverables
125(g): Have the ability to link business imperatives to HRM policies and practices

We start by extracting the relvant columns of the data.


start = (1:ncol(dat))[colnames(dat)=="q116_a"]
end = (1:ncol(dat))[colnames(dat)=="q125_g"]

used = start:end;

Our aim is to fit the following SEM where "distinctiveness", "visibility", "understandability", "legitimacy" and "relevance" are latent variables.

For each manifest variable we also have the double headed loop arrows for variances that we have not shown in the diagram o avoid clutter. The entire model in RAM notation is in the file distmod.spc. The extension .spc is just my arbitrary choice. Let us take a look at a few lines from the file:

dist -> vis, NA, 1       
dist -> und, lam1, NA
dist -> leg, lam2, NA
dist -> rel, lam3, NA

The general syntax for expressing a single-headed straight arrow is like

from -> to , parameter name, initial value

If the parameter name is NA, then the arrow has a prespecified value associated with it (the initial value). If the initial value is NA, then R chooses its own suitable initial value. It is possible to associate the same parameter with multiple arrows. Just use the same parameter name, as done below.

vis -> q116_a, gam11, NA
vis -> q116_b, gam11, NA

The double-headed loop arrow syntax is similar:

dist <-> dist, sig0, NA

vis <-> vis, sig1, NA
und <-> und, sig1, NA

Next we have to load the model into R.

mod1 = specifyModel("distmod.spc")

SEM does not use the raw data. It works with the correlation matrix. It is important that the correlation matrix is nonnegative definite. In presence of missing values this is a problem, because then there are two ways to compute a correlation matrix:

when computing the (i,j)-th entry, use all cases where both of these variables are available. Then the correlation matrix is not gurateeed to be nonnegative definite.
first throw away all cases with at least a single missing value for some variable. Then compute all the correlations based on the remaining data. This will always produce a nonnegative definite matrix.

We can proceed in a naive way by treating the ordinal category numbers as the scores, as doen below.

corMat1 = cor(dat[,used],use="complete.obs")

But a better altrrnative is to respect the ordinal nature of the data, and use polychoric correlation. Unfortunately, the polychoric function of R does handle missing values nicely. So we need to fiter them out manually as follows:

bad = apply(dat[,used],1,function(x) any(is.na(x)))
corMat2 = polychoric(dat[!bad,used])

Finally we can fit the SEM.

res = sem(mod1,corMat1,66) #oops!
res = sem(mod1,corMat1,66,maxiter=200)

If we want to use the polychoric correlation matrix then:

res = sem(mod1,corMat2$rho,66)