Any statistical inference problem has the following basic structure.
We have some random data having joint distribution F which is not
entirely known. We want to make inference about the unknown aspects of F
based on observed data. The inference is typically either an estimation
problem or a testing problem.
The difference between nonparametric and parametric problems has to do
with how much we already assume known about F.
Example:
(Parametric problem)
Suppose X1,...,Xn are iid
N(μ,σ2). We want to estimate μ
and σ2. This is a typical problem from parametric statistics.
Here we
assume that distribution of the X's is completely known except for only
two unknown numbers μ and σ2.
Example:
(Nonparametric problem)
Supose that we are testing the efficacy of a sleeping pill.
Let X1,...,Xn
be the amount of sleep of n patients before taking the pill, and let
Y1,...,Yn be the corresponding amounts after taking
it. We want to test if
the pill really increases one's amount of sleep. Assuming that the
patients behave indepependently we may reasonably assume that
(X1,Y1),...,(Xn,Yn)
are independent, but not
necessarily identically
distributed. We model the effect of the drug as follows. There is an
unknown number theta denoting the median increase of sleep, ie
Zi = Yi-Xi
have theta as its median. Note that we are not assuming that
Z's all have
the same distribution. We are merely assuming that they have a common
median. We want to test
H0: θ = 0 Vs H1: θ > 0
In this example we have not assumed any knowledge about the underlying
distribution except for the exisitence of a common median
θ for the
Z's. Thus our ignorance cannot be summed up as finitely many unknown
numbers. Hence this is a nonparametric statistical inference problem.
Example:
Suppose we X1,...,Xn
iid with continuous density f, which is unknown. We
want to estimate f.
Exercise 0.1:
Why did we need the continuity assumption on f?
Exercise 0.2:
Think of a nonparametric inference situation in regression.
In the model "Y=&alpha+βX+ε, we may have ε's
iid with some unknown distribution F. Or, we may have the model
Y=f(X)+ε, where f itself is some unknown continuous function.
Semiparametric problems
Some people like to call the sleeping pill example as a
semiparametric
problem, because here we are interested in only one unknown number,
θ, though θ is not the only unknown quantity.
Distribution-free techniques
At the heart of any nonparametric statistical inference
problem sits a distribution-free technique.
Example:
(Sign test) We are continuing with the sleeping pill
example. Assume that the Zi's are continuous random variables.
Let the observed Zi's be
-2.3, 3.9, 2.5, -2.1, -3.4, 1.4, 2.4, 1.9
Consider the signs
-1,+1,+1,-1,-1,+1,+1,+1
Count the +1's: T = 5. Reject H0 for "large" T. How large is
"large"? To answer this
we need to know the distribution of the test statistic T under
H0. It is
Bin(8,0.5). If n is large, T is approxly N(n/2, n/4)) under H0.
In this example T is called a distribution-free test statistic
under H0. Here
H0
is a composite hypothesis. In fact it is infinite dimensional in the sense
that a null distribution cannot be specified completely by specifying just
a finite collection of numbers. But still T has a distribution that is
free of F. Classical nonparametric inference proceeds by cleverly
constructing such distribution-free statistics. Thus classical
nonparametric statistical inference is more or lesss a list of such
statistics. However, not all problems have such a handy distribution-free
statistic. For these problems one uses computation-intensive modern
nonparametric inference, that we shall learn about.
Exercise 0.3:
Compute the power of the sign test for the alternative
θ = 5, assuming
that the Zi's are iid with some unknown, common, continuous
distribution. Sample size, n=1000. [Hint: Can you do it if the Zi's are iid N(5,1)?]