Arnab Chakraborty's classical nonparametric statistics notes

cwave.eu5.org
Also see: http://www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Last updated on Fri May 21 11:52:15 IST 2010.

Home >

... Statistics >

... ... Classical nonparametrics

One sample Problem

Introduction

So far we have seen the paired sample problem. However, it was effectively a one sample problem , since we always worked with the differences Z₁,...,Z_n. We now restate our findings in the one sample problem set up.

In the one problem set up we have a single sample

Z₁,...,Z_n

where Z_i's are independent, but not necessarily identically distributed. Each Z_i is a continuous variable. They have a common median θ , which is unknown. We want to test

H₀: θ = 0 Vs H₁: θ > 0

We have already discussed two test procedures for this.

Sign test: uses only the signs of Z's.
Wilcoxon's signed rank test: uses both the signs of Z's as well as the ranks of the |Z|'s. (We need an extra symmetry assumption for this.)

Estimation

Suppose we are interested in estimating θ. The first estimator that comes to mind is the sample median. It is defined as follows. Order the Z's as

Z₍₁₎ ≤ ... ≤ Z_(n).

Then define the sample median as

=	Z_(n+1/2)	if n is odd
	(Z_(n/2)+Z_(n/2+1))/2	if n is even.

Theorem Suppose that Z₁,...,Z_n are iid with some common continuous distribution. Then show that

is median unbiased for θ, i.e.,

P( < θ) = P( > θ).

In other words, θ is the median of

Proof: We shall only do the proof when n=odd. The n=even case is similar but notationally more involved. Define

U_i= 1 if Z_i > θ

0 else

Then U_i's are iid Bern(1/2). The U-vector is distributed uniformly over the set A, which consists of all possible 2ⁿ vectors of 0's and 1's.

Exercise 4.1: Suppose that n=2k+1. Let B A consist of all the vectors with at least k+1 0's. Similarly define C A consisting of all vectors with at least k+1 1's. Show that
A = B ∪ C and B ∩ C = φ.
Next, show that
P(U-vector in B) = P(U-vector in C)
Hence conclude the theorem for n=odd.

For even n, you have first work with n=2. Then for n=2k, split A into 3 parts, those with at least k+1 1's, those with at least k+1 0's and those with exactly k 0's and 1's. The first two parts may be dealt with as in the odd case. The last part needs to split further into two parts using the n=2 argument. We shall not go into the details in this course.

Theorem (Bahadur's represenation) Let Z₁,...,Z_n be iid with some common continuous density, f. Let θ be its median. Assume that f(θ) > 0. Let

denote the sample median based on the sample. Then

= θ + (0.5-F_n(θ))/f(θ) + R_n,

for some R_n where

sqrt(n)*R_n goes to zero as n goes to infinity.

Here F_n denotes the empirical distributin function of the Z_i's.

Proof: Not to be done in this course.

Bahadur's representation has more than one form depending on the assumptions made on the distribution of Z's and the order of R_n.

Exercise 4.2: Use the above theorem to show that sample median has an asymptotic normal distribution. Find out the mean and variance of this normal distribution.

Hodges-Lehmann approach

Here is another method of estimation. This is more general in the sense we do not require the Z_i's to be identically distributed. This approach, called the Hodges Lehmann (HL) approach, is as follows.

Once again consider the Wilcoxon's signed rank test. Let us call its test statistic as

T(Z₁,...,Z_n)

Define the function

f(x) = T(Z₁-x,...,Z_n-x)

Note that f(θ) has mean n(n+1)/4. We can interpret this n(n+1)/4 as the ideal value for f(θ). The HL approach suggests estimating θ using

_HL such that f(

_HL) is close to n(n+1)/4 as possible.

Exercise 4.3: Show that f(x) = #{(Z_i+Z_j)/2 > x : i ≤ j }.
Let h_i=|Z_i-x|. Define
a_ik = I{|h_i| ≥ |h_k|} I{h_i > 0}
Show that f(x) = ∑ _i,k a_ik. Check that
{a_ik+a_ki = 1} iff {(h_i+h_k) > 0}

Exercise 4.4: Show that _HL is the median of the above set.

Exercise 4.5: For symmetric distributions _HL is claimed to outperform the sample median. Do a simple simulation to check this as follows. Generate 1000 samples each of size 100 from N(0,1). Compute sample median as well as _HL for each of the 1000 samples. Estimate bias, variance and MSE, and compare. Also, plot the two histograms.

The same approach may also be used to obtain confidence intervals. For this order the m = n(n+1)/2 numbers

(Z_i+Z_j)/2 for i ≤ j,

A₁ ≤ ... ≤ A_m.

Then

f(x) = #{j : A_j > x}

Exercise 4.6: Show that f(θ) is distribution-free, i.e., the distribution of f(θ) does not depend on the distribution of the Z's (not even on the value of θ).

So we can find an integer k such that

P(k ≤ f(θ) ≤ m-k) = 0.95.

[Since, f(θ) is a discrete random variable the equality may not be exactly achievable.]

Exercise 4.7: Show that for this k we have
P(A_k ≤ θ ≤ A_m-k+1) = 0.95.

{f(θ) ≥ k} iff #{A_i > θ} ≥ k. In this case, A_m-k+1 is guaranteed to be above θ.

Exercise 4.8: Apply the HL approach to the sign test to get another estimator of θ.
[Hint: Define f(x) as the sign test statistic computed based on data shifted by x. Then choose so that f() is as close to Ef(θ) as possible.]