Arnab Chakraborty's classical nonparametric statistics notes

cwave.eu5.org
Also see: http://www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Last updated on Fri May 21 11:52:16 IST 2010.

Home >

... Statistics >

... ... Classical nonparametrics

More one sample procedures

Chisqd test for independence

Example: Suppose that we want to find out if academic accomplishment is independent of gender in Kolkata. Let X denote the academic accomplishment level of a randomly selected person, and let Y denote his/her gender. Here X takes the following possible values:
Pre-school, Primary, Secondary, HS, Bachelor's degree, Masters, Above.
We are interested in testing
H₀: X indep of Y Vs H₁ not indep.
To test this we collect a random sample of size 1200 from the population. For each selected person we observe X and Y. The resulting dataset is an example of a cross-classified dataset. Such a dataset may be presented as a two-way frequency table as follows [The following dataset is hypothetical.]

Pre-school Primary Secondary Higher secondary Bachelor's Master's above

Male 20 57 279 166 104 31 10

Female 45 50 168 157 90 16 7

Let

n_ij = the frquency in (i,j)-th cell.

p_i = P(X = i)

q_j = P(Y = j)

We assume that p_i, q_j > 0 for all i,j.

Exercise 5.1: Find mle of p_i and q_j for all i,j. Use these to estimate P(X=i, Y=j) under H₀. Hence compute
m_ij = estimated E(n_ij) under H₀.

Then the chisqd test rejects H₀ for large values of the following test statistic

χ² = ∑ ∑ (n_ij - m_ij)²/m_ij

The asymptotic distribution of this under H₀ is a χ² distribution. We shall not prove this here. It follows from the fact that under the regularity conditions the null distribution of

-2*log(likelihood ratio)

converges to a χ² distribution. For classified data the degrees of freedom is

number of classes - 1 - number of free parameters estimated.

What is a free parameter? If θ is a parameter, then so is 2*θ. However, if you have already estimated θ, then you do not need to estimate 2*θ separately. Thus we have only one free parameter here. Similarly, if you are estimating a probabiolity vector (p₁,...p_k), then you have only k-1 free parameters, since they add up to one.

Exercise 5.2: What should the degrees of freedom be for the chisqd test above if there are I rows and J columns in the table?

If n ≥ 30 and n_ij,m_ij ≥ 5 for all i,j, then it is customary to consider the asymptotic distribution as a good approximation to the exact distribution. We reject H₀ for large values of the test statistic.

Exercise 5.3: Carry out the test for the given data. Report the P-value.

Exercise 5.4: The test statistic here is only asymptotically distribution-free under H₀. Get a counter example to show that the finite sample distribution is not distribution free.
[Hint: In fact, even to define the test statistic you need infinite sample size. Because, for any finite sample size there is some chance that m_ij is zero for some i,j.]

Exercise 5.5: Typically ISI students either plan to do research or take up a job after M Stat. Last sem I collected some data on this. [The following data is real. It is collected from the Stat Computing students (MII, 2004-2005)]

Research Job Undecided

B.Stat. 2 1 3

non B.Stat. 7 7 1

I want to use this to test if the trend differs from B. Stat to non B. Stat students in the Statisitcal Computing class. However, my sample size is small. How should I proceed? You do not need to compute anything numerically. Just suggest a method.
One method is to perform Fisher's exact test, which finds the exact conditional distribution under H₀ given the marginals. This conditional is like hypergeometric distribution. It is obtained by conditioning multinomial distribution in the same way as hypergeometric distribution is obtained by conditioning binomial.
Another solution will be to list all possible tables with the given marginals. These are equally likely under H₀. Compute the test statistic for all these, and get a histogram. This is the true distribution of the test statistic under H₀. Now locate the test statistic value for the given data in this histogram to find P-value.

Kolmogorov-Smirnov Test

We have data

X₁,...,X_n iid continuous F,

which is unknown. G is some completely specified continuous distribution. We want to test

H₀: F = G Vs. H₁: F ≠ G.

The KS test rejects H₀ for large values of the following test statistic

D = sup_x | F_n(x) - G(x)|.

Here F_n(x) is the empirical distribution function. It is defined as

F_n(x) = proportion of X_i's ≤ x.

This is a reasonable thing to do because, for large n , the empirical distribution function is close to the unknown F.

Exercise 5.6: State some theorem that makes this concept precise.
[Hint: There are more than one such theorems that you already know. Pick one. You should be able to state it rigourously.]

Exercise 5.7: Show that D ≥ 1/(2n).

Exercise 5.8: Suppose that
X₁,...,X_n
are iid continuous G, and
U₁,...,U_n
are iid Unif(0,1). Sort the X's as
X₍₁₎ ≥ ... ≥ X_(n)
and the U's as
U₍₁₎ ≥ ... ≥ U_(n)
Show that the joint distribution of the G(X_(i))'s is the same as that of the U_(i)'s.

Exercise 5.9: Write down the joint density of the U_(i)'s defined above for general n. For n=2 use the density to explicitly compute the distribution of D₂.
The jt density of the U's is
f(u₁,...,u_n) = 1/n! if u₁ ≤ ... ≤ u_n.

One-sided KS

The KS test above is a two sided one. We can also perform one-sided KS tests. For this we need to introduce a partial order over the set of all distributions.

Definition Let X, Y be two random variables with distribution functions F, G, respectively. If for all c,

P(X > c) ≥ P(Y > c),

then we say that X is stochasticaly larger than Y, and write

F G.

Exercise 5.10: Prove/disprove/suitably modify :
F G iff F(x) ≥ G(x) for all x.

Exercise 5.11: Show that is a partial order on the set of all distribution functions.

Exercise 5.12: Is it a complete order as well? Why or why not?

In the one sided version of the KS test we test

H₀: F = G Vs H₁: F G

H₀: F = G Vs H₁: F G.

For this we use the following test statistics:

D₊ = sup_x (F_n(x)-G(x))

and

D_- = sup_x (G(x)-F_n(x))

Exercise 5.13: We reject H₀ for large values of the test statistic. Can you say which test statistic is used for which test?

Exercise 5.14: Show that D₊, D_- ≥ 0. What is the relation between D, D₊ and D_-?