cwave.eu5.org
Also see: http://www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com
Free Guestbook
My Guestbook

Last updated on Fri May 21 11:52:16 IST 2010.

More one sample procedures

Chisqd test for independence

Example: Suppose that we want to find out if academic accomplishment is independent of gender in Kolkata. Let X denote the academic accomplishment level of a randomly selected person, and let Y denote his/her gender. Here X takes the following possible values:
Pre-school, Primary, Secondary, HS, Bachelor's degree, Masters, Above.
We are interested in testing
H0: X indep of Y Vs H1 not indep.
To test this we collect a random sample of size 1200 from the population. For each selected person we observe X and Y. The resulting dataset is an example of a cross-classified dataset. Such a dataset may be presented as a two-way frequency table as follows [The following dataset is hypothetical.]
Pre-schoolPrimarySecondaryHigher secondaryBachelor's Master'sabove
Male20572791661043110
Female455016815790167
Let
nij = the frquency in (i,j)-th cell.
pi = P(X = i)
qj = P(Y = j)
We assume that pi, qj > 0 for all i,j.

Exercise 5.1: Find mle of pi and qj for all i,j. Use these to estimate P(X=i, Y=j) under H0. Hence compute
mij = estimated E(nij) under H0.

Then the chisqd test rejects H0 for large values of the following test statistic
χ2 = ∑ ∑ (nij - mij)2/mij
The asymptotic distribution of this under H0 is a χ2 distribution. We shall not prove this here. It follows from the fact that under the regularity conditions the null distribution of
-2*log(likelihood ratio)
converges to a χ2 distribution. For classified data the degrees of freedom is
number of classes - 1 - number of free parameters estimated.
What is a free parameter? If θ is a parameter, then so is 2*θ. However, if you have already estimated θ, then you do not need to estimate 2*θ separately. Thus we have only one free parameter here. Similarly, if you are estimating a probabiolity vector (p1,...pk), then you have only k-1 free parameters, since they add up to one.

Exercise 5.2: What should the degrees of freedom be for the chisqd test above if there are I rows and J columns in the table?

If n ≥ 30 and nij,mij ≥ 5 for all i,j, then it is customary to consider the asymptotic distribution as a good approximation to the exact distribution. We reject H0 for large values of the test statistic.

Exercise 5.3: Carry out the test for the given data. Report the P-value.

Exercise 5.4: The test statistic here is only asymptotically distribution-free under H0. Get a counter example to show that the finite sample distribution is not distribution free.
[Hint: In fact, even to define the test statistic you need infinite sample size. Because, for any finite sample size there is some chance that mij is zero for some i,j.]

Exercise 5.5: Typically ISI students either plan to do research or take up a job after M Stat. Last sem I collected some data on this. [The following data is real. It is collected from the Stat Computing students (MII, 2004-2005)]
ResearchJobUndecided
B.Stat.213
non B.Stat.771
I want to use this to test if the trend differs from B. Stat to non B. Stat students in the Statisitcal Computing class. However, my sample size is small. How should I proceed? You do not need to compute anything numerically. Just suggest a method.
One method is to perform Fisher's exact test, which finds the exact conditional distribution under H0 given the marginals. This conditional is like hypergeometric distribution. It is obtained by conditioning multinomial distribution in the same way as hypergeometric distribution is obtained by conditioning binomial.

Another solution will be to list all possible tables with the given marginals. These are equally likely under H0. Compute the test statistic for all these, and get a histogram. This is the true distribution of the test statistic under H0. Now locate the test statistic value for the given data in this histogram to find P-value.

Kolmogorov-Smirnov Test

We have data
X1,...,Xn iid continuous F,
which is unknown. G is some completely specified continuous distribution. We want to test
H0: F = G Vs. H1: F ≠ G.
The KS test rejects H0 for large values of the following test statistic
D = supx | Fn(x) - G(x)|.
Here Fn(x) is the empirical distribution function. It is defined as
Fn(x) = proportion of Xi's ≤ x.
This is a reasonable thing to do because, for large n , the empirical distribution function is close to the unknown F.

Exercise 5.6: State some theorem that makes this concept precise.
[Hint: There are more than one such theorems that you already know. Pick one. You should be able to state it rigourously.]

Exercise 5.7: Show that D ≥ 1/(2n).

Exercise 5.8: Suppose that
X1,...,Xn
are iid continuous G, and
U1,...,Un
are iid Unif(0,1). Sort the X's as
X(1) ≥ ... ≥ X(n)
and the U's as
U(1) ≥ ... ≥ U(n)
Show that the joint distribution of the G(X(i))'s is the same as that of the U(i)'s.

Exercise 5.9: Write down the joint density of the U(i)'s defined above for general n. For n=2 use the density to explicitly compute the distribution of D2.
The jt density of the U's is
f(u1,...,un) = 1/n! if u1 ≤ ... ≤ un.

One-sided KS

The KS test above is a two sided one. We can also perform one-sided KS tests. For this we need to introduce a partial order over the set of all distributions.

Definition Let X, Y be two random variables with distribution functions F, G, respectively. If for all c,
P(X > c) ≥ P(Y > c),
then we say that X is stochasticaly larger than Y, and write
F G.

Exercise 5.10: Prove/disprove/suitably modify :
F G iff F(x) ≥ G(x) for all x.

Exercise 5.11: Show that is a partial order on the set of all distribution functions.

Exercise 5.12: Is it a complete order as well? Why or why not?

In the one sided version of the KS test we test
H0: F = G Vs H1: F G
or
H0: F = G Vs H1: F G.
For this we use the following test statistics:
D+ = supx (Fn(x)-G(x))
and
D- = supx (G(x)-Fn(x))

Exercise 5.13: We reject H0 for large values of the test statistic. Can you say which test statistic is used for which test?

Exercise 5.14: Show that D+, D- ≥ 0. What is the relation between D, D+ and D-?


PrevNext
© Arnab Chakraborty (2010)