$\newcommand{\k}[1]{\chi^2_{(#1)}}
\newcommand{\v}{\vec}
\newcommand{\h}{\hat}
\newcommand{\hv}[1]{\hat{\vec#1}}$
So far we have not explicitly put any assumption on the behaviour
of the error. Our approach has been informal, and based on common
sense. But even this informal approach has secretly relied on
some assumptions. The following example shows this.
EXAMPLE: We consider the simplest example of measuring the same
length repeatedly. Suppose that the first 10 measurements are
taken by some precise instruments, and the remaining 10 by a
less precise instrument. Now taking simple average does not seem
the best thing to do. We feel that we should give more weight to
the precise measurements.
Here are the assmptions that are commonly made: the errors have
mean 0, have the same (finite) variance, and are uncorrelated among themselves. This is called
the Gauss-Markov set up.
We shall investigate the properties of the common sense method
under this model, i.e., find expectation and variance-covariance
matrix for the least squares estimators. The first hurdle that we
encounter is that least squares estimators may not be
unique. There are two ways to tackle this problem:
The first way (taken by most foreign authors) is to assume
that we have dropped redundant columns from the design matrix,
so that $X$ is now full column rank (leading to some
particular choice of least squares estimators). These authors
freely use the expression $(X'X)^{-1},$ assuming $X$
is full column rank.
The second way (what we shall take) is to notice that
certain properties of the least squares estimators is invariant
in spite of the nonuniqueness of the least squares estimators.
The first approach is mathematically easier and is like an
engineering approach: get the thing done, and that's it! We start
by dropping redundant columns in some arbitrary way, and do not
care how that arbitrariness affects our final result. The
second approach takes care of that.
Here the plane represents $\col(X).$ The vector $\hv y$
is the (orthogonal) projection of $\v y$
onto $\col(X).$ Let's understand what this means.
$\v y\in{\mathbb R}^n$ and $\col(X)$ is a subspace
in ${\mathbb R}^n.$ Now, $\col(X)$ has an orthogonal complement
inside ${\mathbb R}^n.$ We denote it by $\col(X)^\perp.$
(Basically $\col(X)^\perp$ is just the set of all vectors
orthogonal to $\col(X).$ This set also happens to be a
subspace.)
Then any vector $\v x\in{\mathbb R}$ can be uniquely split up into
two parts:
$$
\v x = \v x_1 + \v x_2,
$$
where $\v x_1\in\col(X)$ and $\v x_2\in\col(X)^\perp.$
The map $\v x\mapsto \v x_1$ is called the orthogonal
projection onto $\col(X)$.
We like to consider it as a map from ${\mathbb R}^n$ to ${\mathbb R}^n$
(the second ${\mathbb R}^n$ is just the codomain, the range being
$\col(X)$).
Think of $\v x_1$ as the shadow of $\v x$ on the
screen $\col(X)$ under light shining orthogonally on the screen.
Clearly, the projection map is
linear. (Not clear?)
If the object $\v x$ is scaled, the shadow also scales by
the same factor. The shadow of a parallelogram is again a parallelogram.
So we may represent it by a matrix $P_X,$ say, which
is $n\times n,$ since the map is from ${\mathbb R}^n$ to ${\mathbb R}^n.$
With this notation we may write $\hv y = P_X\v y.$
You can create infinitely many more by taking different solutions
of the normal equations. Here is a different kind of least square
estimator, made by "mixing" based on $y:$
$$\begin{eqnarray*}
\h \mu & = & \left\{\begin{array}{ll}0&\text{if }y_{11}>0\\\b y_{..}&\text{otherwise.}\end{array}\right.\\
\h \alpha_1 & = & \left\{\begin{array}{ll}\b y_{1.}&\text{if }y_{11}>0\\\b y_{2.}-\b y_{..}&\text{otherwise.}\end{array}\right.\\
\h \alpha_2 & = & \left\{\begin{array}{ll}\b y_{2.}&\text{if }y_{11}>0\\\b y_{2.}-\b y_{..}&\text{otherwise.}\end{array}\right.
\end{eqnarray*}$$
Let this example convince you that a general least square
estimator $\hv \beta $ is of the form $\hv \beta =
\h\beta_* + v(\v y)$ where $\hv \beta_*$ is any particular
least square estimator and $v:{\mathbb R}^n\rightarrow\nul(X'X).$
This rather complicated form prevents us from
finding $E(\hv\beta).$ If you do not see why, just try to
solve the following exercise.
EXERCISE: Again consider the 2-way ANOVA model from the last
example. Find $E(\h \mu)$ for the "mixed" version of $\h
\mu $ given there.
However, there are situations where this arbitrariness of the
choice of least square estimator can do us no harm. The main idea
behind all such results is that $\hv y = X\hv \beta $ is
always the same. It is the projection of $\v y$
onto $\col(X).$ So if we have something involving an
arbitrary least square estimator $\hv \beta,$ we try to see
if we expres that in terms of $\hv y.$ If we can, we are
saved, the arbitrariness cannot harm us anymore.
Keep this in mind as you solve the exercise below.
EXERCISE: Same example continued. For all the three choices of $\hv
\beta$ find $\h \mu + \h \alpha_1.$ Are you getting
different answers? Try some other least square estimators to see
if you get different values?
Again the "tweak without letting off the alarm" game helps to
understand this. When we move from one least square estimator to
another, we are never allowing the alarm to go off, i.e., the
"watched" quantities remain the same.
The following theorem makes this intuition precise.
Proof:
Let $\hv \beta_*$ be some particular least square estimator. Then
the set of all least square estimators is $\hv \beta_* + \nul(X'X).$
So the required condition is satisfied if and only if $\v \ell\in (\nul(X'X))^\perp = \row(X'X) = \row(X).$
[QED]
We could have also used the "$\hv y$ is invariant" idea for
the if part:
If part: Let $\v\ell\in\row(X).$ Then $\v\ell' = \v
b' X$ for some $\v b.$
So $\v\ell'\hv \beta = \v b' X\hv \beta = \v b' \hv y.$
Since $\hv y$ is invariant under choice of $\hv \beta,$
hence done.
Unfortunately, the only-if part can't be tackled like this.
Notice that this theorem makes no use of the Gauss-Markov set
up. It is a pure linear algebraic fact. Interestingly enough, the
condition $\v\ell\in\row(X)$ also crops up in the context of
the Gauss-Markov set up. To see this we start with a definition.
The next theorem is where the condition $\v\ell\in\row(X)$
makes its second appearance.
Proof:Only if part: Let $\v \ell' \v \beta$ be estimable. Then
there is some $\v b\in{\mathbb R}^n$ such that $E(\v b' \v y) = \v \ell'
\v\beta $ for all values of $\v\beta.$
So $\v b'X \v \beta = \v \ell' \v \beta $ for all $\v \beta.$
This means $\v \ell' = \v b'X.$ Hence $\v \ell\in\row(X),$ as
required.
If part: Let $\v\ell\in\row(X).$ Then $\v\ell' =
b'X$ for some $b\in{\mathbb R}^n.$
Consider the estimator $\v b' y.$ Its expectation is
$$
E(\v b' \v y) = \v v'X \v\beta = \v\ell' \v\beta,
$$
as required.
[QED]
The two theorems together show that
estimability of $\v\ell' \v\beta $ is equivalent to uniqueness
of $\v\ell' \hv\beta$ (both the conditions being equivalent
to the common condition $\v\ell\in\row(X)$).
Clearly, finding $\v\ell$'s in $\row(X)$ is of great
importance. This motivates the following definition.
Warning: This is not the "standard definition". The
"standard definition" is "$\v\ell'\v \beta$, where the
components of $\v\ell$ add up to 0."
Examples are $\alpha_1-\alpha_2$
and $\alpha_1-2\alpha_2+\alpha_3.$ However, nobody would
consider a "contrast" like $\mu-\alpha_1$ that compares
"different types" of parameters. Looking at the
various usage, it seems to me that the definition I gave is what
everybody uses behind the scene. The cases
like $\alpha_1-\alpha_2$
or $\alpha_1-2\alpha_2+\alpha_3$ being the most frequntly
used contrasts.
While row-echelon forms and other heavy weight tools
from linear algebra might help in general, often you can pick
such $\v\ell$'s by our familiar "tweak without letting off the
alarm" game.
Try to tweak the components of $\v\beta$ without
changing $X \v\beta.$ The things that do not change indicate
the $\v\ell$'s. The following example illustrates this.
EXAMPLE: Same example continued. We shall show by the tweaking game
that $\mu $ is not estimable.
SOLUTION:
Add 1 to $\mu $, and adust by subtracting 1 from
the $\alpha_i$'s. Clearly, the distrbution of the data does
not change. So there is no way you can meaningfully estimate $\mu$ from
the data.
Proof:
Here $\v\ell' = \v b' X$
for some $\v b.$
Hence $E(\v\ell' \hv \beta) = E(\v b'X\hv \beta) = E(\v b'\hv
y) = \v b'E(\hv y) = \v b' X \v \beta =
\v\ell' \v\beta,$ as required.
[QED]
Proof:Step 1: Shall show unbiasedness.
Estimable, hence $\v\ell \in\row( X).$
Now $\row(X) = \row(X'X).$ So $\v\ell' = \v b'X'X$ for some $\v b\in{\mathbb R}^p.$
Hence $E(\v\ell'\hv \beta) = E(\v b'X'X\hv \beta) = \v b'E(X'\v y) =
= \v b'X'X \v\beta = \v\ell' \v\beta,$ as required.
Step 2: Shall show that for any unbiased $\v c'\v y$ we
have $V(\v c'\v y) \ge V(\v\ell' \hv \beta).$
$$\begin{eqnarray*}
V(\v c'\v y)
& = & V(\v c'\v y - \v\ell'\hv \beta+\v\ell'\hv \beta)\\
& = & V(\v c'\v y - \v\ell'\hv \beta)+V(\v\ell'\hv \beta) + 2cov(\v c'\v y - \v\ell'\hv \beta,\v\ell'\hv \beta).
\end{eqnarray*}$$
Enough to show that the covariance vanishes.
Now $\v\ell'\hv \beta = b' X'X\hv \beta = b'X'\v y.$ So the covariance is
$$
cov(\v c'\v y - \v b'X'\v y,\v b'X'\v y) = \sigma^2 (\v c'-\v b'X')X\v b
= \sigma^2 (\v c'X - \v b'X'X) \v b.
$$
Now since $\v c'\v y$ and $\v\ell'\hv \beta = \v b'X'\v y$ are both
unbiased, hence $E(\v c'\v y) = E(\v b'X'\v y)$, i.e., $\v c'X \v\beta =
\v b'X'X \v\beta.$
Since this holds for all values of $\v\beta,$ hence $\v c'X = \v b'X'X.$
Hence the covariance is zero, as required.
Step 3: Shall show uniqueness.
From above we have $V(\v c'\v y) = V(\v\ell'\hv \beta)+
V(\v c'\v y-\v\ell'\hv \beta).$ Hence
if $V(\v c'\v y) = V(\v\ell'\hv \beta),$ then we see
that $V(\v c'\v y-\v\ell'\hv \beta)=0.$ Hence $\v c'\v y-\v\ell'\hv \beta =
0$ with probability 1 (since both are unbiased).
[QED]
EXAMPLE: We consider the 1-way ANOVA model once again:
$$
y_{ij} = \mu + \alpha_i + \epsilon_{ij},
$$
for $i=1,2$ and $j=1,...,3.$ We have seen that the BLUE
for $\mu + \alpha_i$ is $\b y_{i.}.$ Its variance
is $\frac{\sigma^2}{3}.$ Also the covariance is 0.
Proof:
$L'\hv \beta = B'(X'X)\hv \beta = B'X'y.$
Its variance-covariance matrix is
$$
B'X' (\sigma^2 I) X B = \sigma^2 B'X'X B,
$$
as required.
[QED]
In the special case where $X$ is full column rank, we
have $V(\hv \beta) = \sigma^2 (X'X) ^{-1}.$
One can guess that the residual $y-\h y$ should help us to
estimate $\sigma^2.$ The following example is or first
attempt to turn this guess into an estimator.
EXAMPLE: Consider the meaurement model
$$
y_i = \mu + \epsilon_i,
$$
where $\epsilon_i$'s are uncorrelated with zero mean and
variance $\sigma^2 < \infty.$
Here we know that an unbiased estimator for $\sigma^2 $ is
$$
\h \sigma^2 = \frac{1}{n-1} \sum (y_i-\b y_.)^2.
$$
Here the denominator $n-1$ may be naively thouht of
as $n-$number of parameters. The following example sharpens
this naive understanding.
EXAMPLE: Consider the meaurement model
$$
y_i = \mu_1+\mu_2 + \epsilon_i,
$$
where $\epsilon_i$'s are uncorrelated with zero mean and
variance $\sigma^2 < \infty.$
Actually this is the same model as before. So still an unbiased estimator for $\sigma^2 $ is
$$
\h \sigma^2 = \frac{1}{n-1} \sum (y_i-\b y_.)^2.
$$
So the denominator is more correctly thought of
as $n-$number of estimable parameters. Still this is not
perfect because if $\mu_1+\mu_2$ is estimable, so
is $2(\mu_1+\mu_2).$ Hence we sharpen this further
to $n-$number of independent estimable prameters.
A smarter formulation is as follows.
Consider the 2-way layout without interaction: $y_{ij} =
\mu+\alpha_i+\beta_j+\epsilon_{ij},$ where $i=1,...,I$
and $j=1,...,J.$ Assume the Gauss-Markov set up. Find an
unbiased estimator of $\sigma^2.$
Suppose that you are asked to repeat the above exercise using the model with interaction: $y_{ij} =
\mu+\alpha_i+\beta_j+\gamma_{ij}+\epsilon_{ij},$ where $i=1,...,I$
and $j=1,...,J.$ Do you suspect a problem? Actually try to
solve the problem to confirm your suspicion (if any).
Gauss-Markov theorem made assumptions on only the first two
moments of the error, and concluded that least square estimator
of any estimable parametric function is its BLUE (i.e., minimum
variance among all linear unbiased estimators. Show that
if we also assume Gaussian distrbution for the errors, the least square
estimator of an estimable parametric function is its UMVUE
(i.e., minimum variance among all unbiased estimators).
Let $\v\ell\in{\mathbb R}^n.$ Call $\v\ell'\v y$ a Linear
Zero Function (LZF) if $\forall \v \beta ~~E(\v\ell'\v y) =
0.$ Show that this happens
iff $\v\ell\in\col(X)^\perp.$
Continutation of the above exercise. Show that $\v\ell'\v
y$ is an LZF iff there is a vector $\v v$ such
that $X'\v v = \v 0$ and $\v\ell'\v y = \v v'\v y$
with probability 1.
Why are LZF's useful? We know that "ideally" they should be
0. So their deviation from 0 gives us an idea
about $\sigma^2.$ It is easy to see that the set of all
LZF's is a vector space. The larger it is (i.e., the bigger its
dimension) we should expect $\h \sigma^2$ to be more
reliable. Come up with a mathematical result that captures this idea.
This exercise revisits the very first problem encounted in
this page: measurements by intruments of different precision
levels. A fixed unknown length $\ell$ is measured 10 times
independently by a precise instrument, and then again 10 more
time independently by a less precise instrument. The model is
$$
y_i = \ell + \epsilon_i,
$$
where $\epsilon_i$'s are uncorrelated, and have mean 0. Also
$$
V(\epsilon_i) = \left\{\begin{array}{ll}\sigma^2 &\text{if }i=1,...,10\\2 \sigma^2 &\text{if }i=11,...,20\\ \end{array}\right.
$$
for some unknown $\sigma^2>0.$
Reduce this to a Gauss-Markov set up, and
estimate $\ell$ and $\sigma^2.$
Let $\Sigma $ be any known PD matrix (i.e., $\Sigma =
S S'$ for some nonsingular matrix $S$). Consider the
linear model $\v y = X \v\beta + \v \epsilon, $
where $E(\v \epsilon)=\v 0$ and $V(\v \epsilon)=\sigma^2
\Sigma$ for some unknown $\sigma^2>0.$ Reduce this to a Gauss-Markov set up, and
estimate $\v \beta$ and $\sigma^2.$
Comments
To post an anonymous comment, click on the "Name" field. This
will bring up an option saying "I'd rather post as a guest."