Example:
Suppose that we want to find out if academic accomplishment is independent
of gender in Kolkata. Let X denote the academic accomplishment level of a
randomly selected person, and let Y denote his/her gender. Here X takes
the following possible values:
To test this we collect a random sample of size 1200 from the
population. For each selected person
we observe X and Y. The resulting dataset is
an example of a cross-classified dataset. Such a dataset may be
presented as a two-way frequency table as follows [The following dataset
is hypothetical.]
Pre-school
Primary
Secondary
Higher secondary
Bachelor's
Master's
above
Male
20
57
279
166
104
31
10
Female
45
50
168
157
90
16
7
Let
nij
=
the frquency in (i,j)-th cell.
pi
=
P(X = i)
qj
=
P(Y = j)
We assume that pi, qj > 0 for all i,j.
Exercise 5.1:
Find mle of pi and qj for all i,j. Use these to
estimate P(X=i, Y=j) under H0. Hence compute
mij = estimated E(nij) under H0.
Then the chisqd test rejects H0 for large values of the
following test statistic
χ2 =
∑
∑
(nij -
mij)2/mij
The asymptotic distribution of this under H0 is a χ2
distribution. We shall not prove this here. It follows from the fact that
under the regularity conditions the null distribution of
-2*log(likelihood ratio)
converges to a χ2 distribution. For classified data the degrees of
freedom is
number of classes - 1 - number of free parameters
estimated.
What is a free parameter? If θ is a parameter, then so is
2*θ. However, if you have already estimated θ, then you do
not need to estimate 2*θ separately. Thus we have only one free
parameter here. Similarly, if you are estimating a probabiolity vector
(p1,...pk), then you have only k-1 free parameters,
since they add up to one.
Exercise 5.2:
What should the degrees of freedom be for the chisqd test above if there
are
I rows and J columns in the table?
If n ≥ 30 and nij,mij ≥ 5 for all i,j,
then it is
customary to consider the asymptotic distribution as a good approximation
to the exact distribution. We reject H0 for large values of the
test statistic.
Exercise 5.3:
Carry out the test for the given data. Report the P-value.
Exercise 5.4:
The test statistic here is only asymptotically distribution-free
under H0. Get a counter example to show that the finite sample
distribution is not distribution free. [Hint: In fact, even to define the test statistic you need infinite
sample size. Because, for any finite sample size there is some chance that
mij is zero for some i,j.]
Exercise 5.5:
Typically ISI students either plan to do research or take up a job after M
Stat. Last sem I collected some data on this. [The following data is
real. It is collected from the Stat Computing students (MII, 2004-2005)]
Research
Job
Undecided
B.Stat.
2
1
3
non B.Stat.
7
7
1
I want to use this to test if the trend differs from B. Stat to
non B. Stat students in the Statisitcal Computing class.
However, my sample size is small. How should I proceed?
You do not need to compute anything numerically. Just suggest a method.
One method is to perform Fisher's exact test, which finds the exact
conditional distribution under H0 given the marginals. This
conditional is like hypergeometric distribution. It is obtained by
conditioning
multinomial distribution in the same way as hypergeometric distribution is
obtained by conditioning binomial.
Another solution will be to list all possible tables with the given
marginals. These are equally likely under H0. Compute the test
statistic for all these, and get a histogram. This is the true
distribution of the test statistic under H0. Now locate the
test statistic value for the given data in this histogram to find P-value.
Kolmogorov-Smirnov Test
We have data
X1,...,Xn iid continuous F,
which is unknown. G is some completely specified continuous
distribution. We want to test
H0: F = G Vs. H1: F ≠ G.
The KS test rejects H0 for large values of the following test
statistic
D = supx | Fn(x) - G(x)|.
Here Fn(x) is the empirical distribution function. It is
defined as
Fn(x) = proportion of Xi's ≤ x.
This is a reasonable thing to do because, for large n , the empirical
distribution function is close to the unknown F.
Exercise 5.6:
State some theorem that makes this concept precise. [Hint: There are more than one such theorems that you already know.
Pick one.
You should be able to
state it rigourously.]
Exercise 5.7:
Show that D ≥ 1/(2n).
Exercise 5.8:
Suppose that
X1,...,Xn
are iid continuous G, and
U1,...,Un
are iid Unif(0,1). Sort the X's as
X(1) ≥ ... ≥ X(n)
and the U's as
U(1) ≥ ... ≥ U(n)
Show that the joint distribution of the G(X(i))'s
is the same as that of the U(i)'s.
Exercise 5.9:
Write down the joint density of the U(i)'s defined
above for general n.
For n=2 use the density to explicitly compute the distribution of
D2.
The jt density of the U's is
f(u1,...,un) = 1/n! if
u1 ≤ ... ≤ un.
One-sided KS
The KS test above is a two sided one. We can also perform one-sided KS
tests. For this we need to introduce a partial order over the set of all
distributions.
Exercise 5.10:
Prove/disprove/suitably modify :
F G iff F(x) ≥ G(x) for all x.
Exercise 5.11:
Show that is a partial order on the set of all distribution
functions.
Exercise 5.12:
Is it a complete order as well? Why or why not?
In the one sided version of the KS test we test
H0: F = G Vs H1: F G
or
H0: F = G Vs H1: F G.
For this we use the following test statistics:
D+ = supx (Fn(x)-G(x))
and
D- = supx (G(x)-Fn(x))
Exercise 5.13:
We reject H0 for large values of the test statistic. Can you
say which test statistic is used for which test?
Exercise 5.14:
Show that D+, D- ≥ 0. What is the relation
between D, D+ and D-?