Arnab Chakraborty's classical nonparametric statistics notes

cwave.eu5.org
Also see: http://www.angelfire.com/dragon/letstry
cwave04 at yahoo dot com

My Guestbook

Last updated on Fri May 21 11:52:13 IST 2010.

Home >

... Statistics >

... ... Classical nonparametrics

Classical nonparametric statistics

Introduction

Any statistical inference problem has the following basic structure.

We have some random data having joint distribution F which is not entirely known. We want to make inference about the unknown aspects of F based on observed data. The inference is typically either an estimation problem or a testing problem.

The difference between nonparametric and parametric problems has to do with how much we already assume known about F.

Example: (Parametric problem) Suppose X₁,...,X_n are iid N(μ,σ²). We want to estimate μ and σ². This is a typical problem from parametric statistics. Here we assume that distribution of the X's is completely known except for only two unknown numbers μ and σ².

Definition If the distribution of the data is completely known except for finitely many unknown numbers, then the problem is called a parametric problem. Otherwise, we have a nonparametric problem. In a parametric situation each of the finitely many unknown numbers is called a parameter.

Example: (Nonparametric problem) Supose that we are testing the efficacy of a sleeping pill. Let X₁,...,X_n be the amount of sleep of n patients before taking the pill, and let Y₁,...,Y_n be the corresponding amounts after taking it. We want to test if the pill really increases one's amount of sleep. Assuming that the patients behave indepependently we may reasonably assume that
(X₁,Y₁),...,(X_n,Y_n)
are independent, but not necessarily identically distributed. We model the effect of the drug as follows. There is an unknown number theta denoting the median increase of sleep, ie
Z_i = Y_i-X_i
have theta as its median. Note that we are not assuming that Z's all have the same distribution. We are merely assuming that they have a common median. We want to test
H₀: θ = 0 Vs H₁: θ > 0

In this example we have not assumed any knowledge about the underlying distribution except for the exisitence of a common median θ for the Z's. Thus our ignorance cannot be summed up as finitely many unknown numbers. Hence this is a nonparametric statistical inference problem.

Example: Suppose we X₁,...,X_n iid with continuous density f, which is unknown. We want to estimate f.

Exercise 0.1: Why did we need the continuity assumption on f?

Exercise 0.2: Think of a nonparametric inference situation in regression.
In the model "Y=&alpha+βX+ε, we may have ε's iid with some unknown distribution F. Or, we may have the model Y=f(X)+ε, where f itself is some unknown continuous function.

Semiparametric problems

Some people like to call the sleeping pill example as a semiparametric problem, because here we are interested in only one unknown number, θ, though θ is not the only unknown quantity.

Distribution-free techniques

At the heart of any nonparametric statistical inference problem sits a distribution-free technique.

Example: (Sign test) We are continuing with the sleeping pill example. Assume that the Z_i's are continuous random variables. Let the observed Z_i's be
-2.3, 3.9, 2.5, -2.1, -3.4, 1.4, 2.4, 1.9
Consider the signs
-1,+1,+1,-1,-1,+1,+1,+1
Count the +1's: T = 5. Reject H₀ for "large" T. How large is "large"? To answer this we need to know the distribution of the test statistic T under H₀. It is Bin(8,0.5). If n is large, T is approxly N(n/2, n/4)) under H₀.

In this example T is called a distribution-free test statistic under H₀. Here H₀ is a composite hypothesis. In fact it is infinite dimensional in the sense that a null distribution cannot be specified completely by specifying just a finite collection of numbers. But still T has a distribution that is free of F. Classical nonparametric inference proceeds by cleverly constructing such distribution-free statistics. Thus classical nonparametric statistical inference is more or lesss a list of such statistics. However, not all problems have such a handy distribution-free statistic. For these problems one uses computation-intensive modern nonparametric inference, that we shall learn about.

Exercise 0.3: Compute the power of the sign test for the alternative θ = 5, assuming that the Z_i's are iid with some unknown, common, continuous distribution. Sample size, n=1000.
[Hint: Can you do it if the Z_i's are iid N(5,1)?]