[Home]

Table of contents


$\newcommand{\k}[1]{\chi^2_{(#1)}} \newcommand{\v}{\vec} \newcommand{\h}{\hat} \newcommand{\hv}[1]{\hat{\vec#1}}$ So far we have not explicitly put any assumption on the behaviour of the error. Our approach has been informal, and based on common sense. But even this informal approach has secretly relied on some assumptions. The following example shows this.

EXAMPLE: We consider the simplest example of measuring the same length repeatedly. Suppose that the first 10 measurements are taken by some precise instruments, and the remaining 10 by a less precise instrument. Now taking simple average does not seem the best thing to do. We feel that we should give more weight to the precise measurements.

Gauss-Markov set up

Here are the assmptions that are commonly made: the errors have mean 0, have the same (finite) variance, and are uncorrelated among themselves. This is called the Gauss-Markov set up.
Gauss-Markov set up $\v y = X \v \beta + \v \epsilon, $ where $E(\v \epsilon)=\v 0$ and $V(\v \epsilon)=\sigma^2 I.$
We shall investigate the properties of the common sense method under this model, i.e., find expectation and variance-covariance matrix for the least squares estimators. The first hurdle that we encounter is that least squares estimators may not be unique. There are two ways to tackle this problem: The first approach is mathematically easier and is like an engineering approach: get the thing done, and that's it! We start by dropping redundant columns in some arbitrary way, and do not care how that arbitrariness affects our final result. The second approach takes care of that.

Using projection

Remember the following picture:
Here the plane represents $\col(X).$ The vector $\hv y$ is the (orthogonal) projection of $\v y$ onto $\col(X).$ Let's understand what this means.

$\v y\in{\mathbb R}^n$ and $\col(X)$ is a subspace in ${\mathbb R}^n.$ Now, $\col(X)$ has an orthogonal complement inside ${\mathbb R}^n.$ We denote it by $\col(X)^\perp.$ (Basically $\col(X)^\perp$ is just the set of all vectors orthogonal to $\col(X).$ This set also happens to be a subspace.)

Then any vector $\v x\in{\mathbb R}$ can be uniquely split up into two parts: $$ \v x = \v x_1 + \v x_2, $$ where $\v x_1\in\col(X)$ and $\v x_2\in\col(X)^\perp.$ The map $\v x\mapsto \v x_1$ is called the orthogonal projection onto $\col(X)$.

We like to consider it as a map from ${\mathbb R}^n$ to ${\mathbb R}^n$ (the second ${\mathbb R}^n$ is just the codomain, the range being $\col(X)$).

Think of $\v x_1$ as the shadow of $\v x$ on the screen $\col(X)$ under light shining orthogonally on the screen.

Clearly, the projection map is linear. (Not clear?) So we may represent it by a matrix $P_X,$ say, which is $n\times n,$ since the map is from ${\mathbb R}^n$ to ${\mathbb R}^n.$

With this notation we may write $\hv y = P_X\v y.$

Some properties of $P_X$

We shall rarely need to write the explicit form of $P_X.$ All that we shall need are the following properties:
  1. $P_X$ is a symmetric and idempotent.
  2. $\col(P_X) = \col(X).$
  3. If $\v x\in\col(X),$ then $P_X \v x = \v x.$ In particular, $P_X X = X.$
From these we can easily derive expressions for $E(\hv y)$ and $V(\hv y).$
TheoremUnder the Gauss-Markov set up $E(\hv y) = X \v \beta.$

Proof: $E(\hv y) = E(P_X\v y) = P_X E(\v y) = P_X X \v \beta = X\v \beta.$ [QED]

EXERCISE: Derive an expression for $V(\hv y)$ under the Gauss-Markov set up.

Non-uniqueness of least square estimators

EXAMPLE:  Consider the 2-way ANOVA model: $$ \left[\begin{array}{ccccccccccc}y_{11}\\y_{12}\\y_{13}\\y_{21}\\y_{22}\\y_{23} \end{array}\right] = \left[\begin{array}{ccccccccccc} 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 1 & 0\\ 1 & 0 & 1\\ 1 & 0 & 1\\ 1 & 0 & 1 \end{array}\right]\left[\begin{array}{ccccccccccc}\mu\\\alpha_1\\\alpha_2 \end{array}\right] + \left[\begin{array}{ccccccccccc}\epsilon_{11}\\\epsilon_{12}\\\epsilon_{13}\\\epsilon_{21}\\\epsilon_{22}\\\epsilon_{23} \end{array}\right]. $$ Here is one least square estimators of $\v\beta = (\mu,\alpha_1,\alpha_2)'.$

$\h \mu = 0,$ $\h \alpha_1 = \b y_{1.},$ $\h \alpha_2 = \b y_{2.}.$
Here is another:
$\h \mu = \b y_{..},$ $\h \alpha_1 = \b y_{1.}-\b y_{..},$ $\h \alpha_2 = \b y_{2.}-\b y_{..}.$
You can create infinitely many more by taking different solutions of the normal equations. Here is a different kind of least square estimator, made by "mixing" based on $y:$ $$\begin{eqnarray*} \h \mu & = & \left\{\begin{array}{ll}0&\text{if }y_{11}>0\\\b y_{..}&\text{otherwise.}\end{array}\right.\\ \h \alpha_1 & = & \left\{\begin{array}{ll}\b y_{1.}&\text{if }y_{11}>0\\\b y_{2.}-\b y_{..}&\text{otherwise.}\end{array}\right.\\ \h \alpha_2 & = & \left\{\begin{array}{ll}\b y_{2.}&\text{if }y_{11}>0\\\b y_{2.}-\b y_{..}&\text{otherwise.}\end{array}\right. \end{eqnarray*}$$

Let this example convince you that a general least square estimator $\hv \beta $ is of the form $\hv \beta = \h\beta_* + v(\v y)$ where $\hv \beta_*$ is any particular least square estimator and $v:{\mathbb R}^n\rightarrow\nul(X'X).$

This rather complicated form prevents us from finding $E(\hv\beta).$ If you do not see why, just try to solve the following exercise.

EXERCISE: Again consider the 2-way ANOVA model from the last example. Find $E(\h \mu)$ for the "mixed" version of $\h \mu $ given there.

However, there are situations where this arbitrariness of the choice of least square estimator can do us no harm. The main idea behind all such results is that $\hv y = X\hv \beta $ is always the same. It is the projection of $\v y$ onto $\col(X).$ So if we have something involving an arbitrary least square estimator $\hv \beta,$ we try to see if we expres that in terms of $\hv y.$ If we can, we are saved, the arbitrariness cannot harm us anymore.

Keep this in mind as you solve the exercise below.

EXERCISE: Same example continued. For all the three choices of $\hv \beta$ find $\h \mu + \h \alpha_1.$ Are you getting different answers? Try some other least square estimators to see if you get different values?

Again the "tweak without letting off the alarm" game helps to understand this. When we move from one least square estimator to another, we are never allowing the alarm to go off, i.e., the "watched" quantities remain the same.

The following theorem makes this intuition precise.
Theorem $\v \ell' \hv\beta $ will be the same for all choices of the least square estimator if and only if $\v \ell\in\row(X).$

Proof: Let $\hv \beta_*$ be some particular least square estimator. Then the set of all least square estimators is $\hv \beta_* + \nul(X'X).$

So the required condition is satisfied if and only if $\v \ell\in (\nul(X'X))^\perp = \row(X'X) = \row(X).$ [QED]

We could have also used the "$\hv y$ is invariant" idea for the if part:
If part: Let $\v\ell\in\row(X).$ Then $\v\ell' = \v b' X$ for some $\v b.$

So $\v\ell'\hv \beta = \v b' X\hv \beta = \v b' \hv y.$ Since $\hv y$ is invariant under choice of $\hv \beta,$ hence done.
Unfortunately, the only-if part can't be tackled like this.

Notice that this theorem makes no use of the Gauss-Markov set up. It is a pure linear algebraic fact. Interestingly enough, the condition $\v\ell\in\row(X)$ also crops up in the context of the Gauss-Markov set up. To see this we start with a definition.
Definition: Estimable Let $\v \ell\in{\mathbb R}^p.$ We say that $\v \ell' \v \beta$ is (linearly unbiasedly) estimable if there is some fixed $\v b\in{\mathbb R}^n$ (possibly depending on $\v \ell$) such that $E(\v b' \v y) = \v \ell' \v \beta $ for all possible values of $\v \beta.$
The next theorem is where the condition $\v\ell\in\row(X)$ makes its second appearance.
TheoremGauss-Markov set up. $\v \ell' \v \beta$ is estimable iff $\v \ell\in\row(X).$

Proof: Only if part: Let $\v \ell' \v \beta$ be estimable. Then there is some $\v b\in{\mathbb R}^n$ such that $E(\v b' \v y) = \v \ell' \v\beta $ for all values of $\v\beta.$

So $\v b'X \v \beta = \v \ell' \v \beta $ for all $\v \beta.$

This means $\v \ell' = \v b'X.$ Hence $\v \ell\in\row(X),$ as required.

If part: Let $\v\ell\in\row(X).$ Then $\v\ell' = b'X$ for some $b\in{\mathbb R}^n.$

Consider the estimator $\v b' y.$ Its expectation is $$ E(\v b' \v y) = \v v'X \v\beta = \v\ell' \v\beta, $$ as required. [QED]

The two theorems together show that estimability of $\v\ell' \v\beta $ is equivalent to uniqueness of $\v\ell' \hv\beta$ (both the conditions being equivalent to the common condition $\v\ell\in\row(X)$).

Clearly, finding $\v\ell$'s in $\row(X)$ is of great importance. This motivates the following definition.
Definition: Contrast Linear model $\v y = X \v \beta + \v \epsilon.$ By a contrast we understand $\v\ell'\v \beta $ for some $\v\ell\in\row(X).$
Warning: This is not the "standard definition". The "standard definition" is "$\v\ell'\v \beta$, where the components of $\v\ell$ add up to 0." Examples are $\alpha_1-\alpha_2$ and $\alpha_1-2\alpha_2+\alpha_3.$ However, nobody would consider a "contrast" like $\mu-\alpha_1$ that compares "different types" of parameters. Looking at the various usage, it seems to me that the definition I gave is what everybody uses behind the scene. The cases like $\alpha_1-\alpha_2$ or $\alpha_1-2\alpha_2+\alpha_3$ being the most frequntly used contrasts.

While row-echelon forms and other heavy weight tools from linear algebra might help in general, often you can pick such $\v\ell$'s by our familiar "tweak without letting off the alarm" game. Try to tweak the components of $\v\beta$ without changing $X \v\beta.$ The things that do not change indicate the $\v\ell$'s. The following example illustrates this.

EXAMPLE: Same example continued. We shall show by the tweaking game that $\mu $ is not estimable.

SOLUTION: Add 1 to $\mu $, and adust by subtracting 1 from the $\alpha_i$'s. Clearly, the distrbution of the data does not change. So there is no way you can meaningfully estimate $\mu$ from the data.

TheoremGauss-Markov set up. Let $\v\ell\in\row(X)$ be such that $\v\ell'\v \beta $ is estimable. Let $\hv \beta$ be any least square estimator. Then $\v\ell'\hv \beta$ (which is the same for all choices of the least square estimator) is unbiased for $\v\ell' \v\beta.$

Proof: Here $\v\ell' = \v b' X$ for some $\v b.$

Hence $E(\v\ell' \hv \beta) = E(\v b'X\hv \beta) = E(\v b'\hv y) = \v b'E(\hv y) = \v b' X \v \beta = \v\ell' \v\beta,$ as required. [QED]

Definition: BLUE Let $\v\ell' \v\beta $ be estimable. Let $b\in{\mathbb R}^n$ be any fixed vector. We say that $b'y$ is a best linear unbiased estimator (BLUE) for $\v\ell' \v\beta $ if
Gauss-Markov theorem If $\v\ell' \v\beta $ is estimable then $\v\ell' \hv\beta $ is its BLUE, and is unique with probability 1.

Proof: Step 1: Shall show unbiasedness.

Estimable, hence $\v\ell \in\row( X).$

Now $\row(X) = \row(X'X).$ So $\v\ell' = \v b'X'X$ for some $\v b\in{\mathbb R}^p.$

Hence $E(\v\ell'\hv \beta) = E(\v b'X'X\hv \beta) = \v b'E(X'\v y) = = \v b'X'X \v\beta = \v\ell' \v\beta,$ as required.

Step 2: Shall show that for any unbiased $\v c'\v y$ we have $V(\v c'\v y) \ge V(\v\ell' \hv \beta).$ $$\begin{eqnarray*} V(\v c'\v y) & = & V(\v c'\v y - \v\ell'\hv \beta+\v\ell'\hv \beta)\\ & = & V(\v c'\v y - \v\ell'\hv \beta)+V(\v\ell'\hv \beta) + 2cov(\v c'\v y - \v\ell'\hv \beta,\v\ell'\hv \beta). \end{eqnarray*}$$ Enough to show that the covariance vanishes.

Now $\v\ell'\hv \beta = b' X'X\hv \beta = b'X'\v y.$ So the covariance is $$ cov(\v c'\v y - \v b'X'\v y,\v b'X'\v y) = \sigma^2 (\v c'-\v b'X')X\v b = \sigma^2 (\v c'X - \v b'X'X) \v b. $$ Now since $\v c'\v y$ and $\v\ell'\hv \beta = \v b'X'\v y$ are both unbiased, hence $E(\v c'\v y) = E(\v b'X'\v y)$, i.e., $\v c'X \v\beta = \v b'X'X \v\beta.$

Since this holds for all values of $\v\beta,$ hence $\v c'X = \v b'X'X.$

Hence the covariance is zero, as required.

Step 3: Shall show uniqueness.

From above we have $V(\v c'\v y) = V(\v\ell'\hv \beta)+ V(\v c'\v y-\v\ell'\hv \beta).$ Hence if $V(\v c'\v y) = V(\v\ell'\hv \beta),$ then we see that $V(\v c'\v y-\v\ell'\hv \beta)=0.$ Hence $\v c'\v y-\v\ell'\hv \beta = 0$ with probability 1 (since both are unbiased). [QED]

Variance and covariance

EXAMPLE: We consider the 1-way ANOVA model once again: $$ y_{ij} = \mu + \alpha_i + \epsilon_{ij}, $$ for $i=1,2$ and $j=1,...,3.$ We have seen that the BLUE for $\mu + \alpha_i$ is $\b y_{i.}.$ Its variance is $\frac{\sigma^2}{3}.$ Also the covariance is 0.

TheoremConsider the linear model $y = X \v\beta + \epsilon $ under the Gauss-Markov set up. Let $L' \v\beta $ be estimable with $L'=B'(X'X).$ Then the variance-covariance matrix of its BLUE $L' \hv \beta $ is $\sigma^2 B'(X'X)B.$

Proof: $L'\hv \beta = B'(X'X)\hv \beta = B'X'y.$

Its variance-covariance matrix is $$ B'X' (\sigma^2 I) X B = \sigma^2 B'X'X B, $$ as required. [QED]

In the special case where $X$ is full column rank, we have $V(\hv \beta) = \sigma^2 (X'X) ^{-1}.$

Estimating $\sigma^2 $

One can guess that the residual $y-\h y$ should help us to estimate $\sigma^2.$ The following example is or first attempt to turn this guess into an estimator.

EXAMPLE: Consider the meaurement model $$ y_i = \mu + \epsilon_i, $$ where $\epsilon_i$'s are uncorrelated with zero mean and variance $\sigma^2 < \infty.$

Here we know that an unbiased estimator for $\sigma^2 $ is $$ \h \sigma^2 = \frac{1}{n-1} \sum (y_i-\b y_.)^2. $$

Here the denominator $n-1$ may be naively thouht of as $n-$number of parameters. The following example sharpens this naive understanding.

EXAMPLE: Consider the meaurement model $$ y_i = \mu_1+\mu_2 + \epsilon_i, $$ where $\epsilon_i$'s are uncorrelated with zero mean and variance $\sigma^2 < \infty.$

Actually this is the same model as before. So still an unbiased estimator for $\sigma^2 $ is $$ \h \sigma^2 = \frac{1}{n-1} \sum (y_i-\b y_.)^2. $$

So the denominator is more correctly thought of as $n-$number of estimable parameters. Still this is not perfect because if $\mu_1+\mu_2$ is estimable, so is $2(\mu_1+\mu_2).$ Hence we sharpen this further to $n-$number of independent estimable prameters.

A smarter formulation is as follows.
Theorem In the linear model $y = X \v\beta + \epsilon $ under the Gauss-Markov set up $$ \h \sigma^2 = \frac{\|y-\h y\|^2 }{n-r(X)} $$ is an unbiased estimator of $\sigma^2.$

Proof: $$\begin{eqnarray*} E(\|y-\h y\|^2) & = & E[ (y-\h y)' (y-\h y) ]\\ & = & E[ y'(I-P_X)y ]\\ & = & E[ tr(y'(I-P_X)y) ]\\ & = & E[ tr((I-P_X)yy') ]\\ & = & tr((I-P_X)E(yy'). \end{eqnarray*}$$ Now $$ E(yy') = E[(X \v\beta + \epsilon)(X \v\beta + \epsilon)'] = X\v\beta\v\beta' X + E(\epsilon \epsilon') = X\v\beta\v\beta' X + \sigma^2 I. $$ So $$ E(\|y-\h y\|^2) = tr((I-P_X)(X\v\beta\v\beta' X + \sigma^2 I)) = \sigma^2 tr(I-P_X) = \sigma^2 (n-r(X)). $$ [QED]

Normal errors

Here we assue that the errors are IID $N(0,\sigma^2).$

MLE

Theorem $\hv \beta $ is MLE of $\v\beta $ if and only if it is a least square estimator. Also MLE of $\sigma^2 $ is $$ \h \sigma^2 = \frac{RSS}{n}. $$ Here $RSS$ means Residual Sum of Squares or $\|\v y-\hv y\|^2.$

Exercises

  1. Consider the 2-way layout without interaction: $y_{ij} = \mu+\alpha_i+\beta_j+\epsilon_{ij},$ where $i=1,...,I$ and $j=1,...,J.$ Assume the Gauss-Markov set up. Find an unbiased estimator of $\sigma^2.$
  2. Suppose that you are asked to repeat the above exercise using the model with interaction: $y_{ij} = \mu+\alpha_i+\beta_j+\gamma_{ij}+\epsilon_{ij},$ where $i=1,...,I$ and $j=1,...,J.$ Do you suspect a problem? Actually try to solve the problem to confirm your suspicion (if any).
  3. Gauss-Markov theorem made assumptions on only the first two moments of the error, and concluded that least square estimator of any estimable parametric function is its BLUE (i.e., minimum variance among all linear unbiased estimators. Show that if we also assume Gaussian distrbution for the errors, the least square estimator of an estimable parametric function is its UMVUE (i.e., minimum variance among all unbiased estimators).
  4. Let $\v\ell\in{\mathbb R}^n.$ Call $\v\ell'\v y$ a Linear Zero Function (LZF) if $\forall \v \beta ~~E(\v\ell'\v y) = 0.$ Show that this happens iff $\v\ell\in\col(X)^\perp.$
  5. Continutation of the above exercise. Show that $\v\ell'\v y$ is an LZF iff there is a vector $\v v$ such that $X'\v v = \v 0$ and $\v\ell'\v y = \v v'\v y$ with probability 1.
  6. Why are LZF's useful? We know that "ideally" they should be 0. So their deviation from 0 gives us an idea about $\sigma^2.$ It is easy to see that the set of all LZF's is a vector space. The larger it is (i.e., the bigger its dimension) we should expect $\h \sigma^2$ to be more reliable. Come up with a mathematical result that captures this idea.
  7. This exercise revisits the very first problem encounted in this page: measurements by intruments of different precision levels. A fixed unknown length $\ell$ is measured 10 times independently by a precise instrument, and then again 10 more time independently by a less precise instrument. The model is $$ y_i = \ell + \epsilon_i, $$ where $\epsilon_i$'s are uncorrelated, and have mean 0. Also $$ V(\epsilon_i) = \left\{\begin{array}{ll}\sigma^2 &\text{if }i=1,...,10\\2 \sigma^2 &\text{if }i=11,...,20\\ \end{array}\right. $$ for some unknown $\sigma^2>0.$ Reduce this to a Gauss-Markov set up, and estimate $\ell$ and $\sigma^2.$
  8. Let $\Sigma $ be any known PD matrix (i.e., $\Sigma = S S'$ for some nonsingular matrix $S$). Consider the linear model $\v y = X \v\beta + \v \epsilon, $ where $E(\v \epsilon)=\v 0$ and $V(\v \epsilon)=\sigma^2 \Sigma$ for some unknown $\sigma^2>0.$ Reduce this to a Gauss-Markov set up, and estimate $\v \beta$ and $\sigma^2.$

Comments

To post an anonymous comment, click on the "Name" field. This will bring up an option saying "I'd rather post as a guest."