We learn to solve a linear system of equations during our school days. Here is
a familiar example: $3a+4b=10,$ $4a+b=9$
and $2a+3b=7.$
The solution turns out to be unique: $a=2$ and $b=1.$
Simple! But why would we ever want to solve such a system in real
life? Here is one situation. We are finding the mass of two
different atoms, $A$ and $B.$ We cannot weigh them separately. However, we can
weigh different compounds like $A_3B_4$ and $A_4B$
and $A_2B_3.$ Hence the three equations.
I admit it is a contrived example. But even this example will
show a difficulty that is present all the time in real life. When
we measure anything we make random error. Repeatedly measuring
the same quantity leads to slightly differing results. So we may
get measurements like 9.8, 9.1 and 7.0 in place of the exact
values 10, 9 and 7. So the system is now
$$
3a+4b\approx 9.8\\
4a+b\approx 9.1\\
2a+3b\approx 7.0
$$
Our first impulse may be just to pretend that
the $\approx$'s are $=$'s, solve the system as before,
then report the solution as approximate values for $x$
and $y.$ Just try this out, and you'll see that it's
impossible. The system is inconsistent!
It might seem strange, but a real life measurement scenario
mostly produces inconsistent systems. In mathematics an
inconsistent system is just absurd. But from the view point of
real life applications, we should better call them almost
consistent. It should be possible to get values for the unknowns
such that the LHS is pretty close to the RHS. Possibly, the
simplest such example is when we measure the same
length, $\ell,$ twice
and get two values 10.50 and 10.49. This is actually an
inconsistent system:
$$
\ell = 10.50\\
\ell = 10.49
$$
Mathematically, this is impossible!
However, when we write this more correctly as
$$
\ell \approx 10.50\\
\ell \approx 10.49
$$
there is nothing to make us feel uncomfortable. In fact, we may
even feel happy about the precision of the measurements.
Such an approximate system of linear equations is called
a linear model. The traditional notation is
$$
\v y = X \v \beta + \v \epsilon.
$$
Here $\v y$ is the measurement vector, $X$ is the fixed
known matrix, $\v \beta $ is the vector of unknowns
and $\v \epsilon $ is the random error vector.
We call $\v y$ the response, $X$ the design
matrix, $\v \beta $ the parameter vector.
For example, in our weighing example, we have
$$
\v y = \left[\begin{array}{ccccccccccc}9.8\\9.1\\7.0
\end{array}\right],\quad X = \left[\begin{array}{ccccccccccc}3 & 4\\4 & 1\\2 & 3
\end{array}\right]\text{ and } \beta = \left[\begin{array}{ccccccccccc}a\\b
\end{array}\right].
$$
This may be done by
choosing that value of $\v \beta$ for which $\v y$ is
as close as possible to $X\v \beta.$ In other words,
$$
\hv \beta = \argmin_{\v \beta} \|\v y-X\v \beta\|.
$$
Here $\|\cdots\|$ denotes the Euclidean distance. So we are
using our familiar least squares method. Pictorially, this means
projecting $\v y$ on $\col(X)$ and expressing the
projection ($\hv y,$ the foot of the perpendicular) in
terms of the columns of $X.$
For example, in the weighing case, we are looking for $a,b$
such that $X \beta $ is as close as possible to $y.$
Now
$$
X \beta = \left[\begin{array}{ccccccccccc}3 & 4\\4 & 1\\2 & 3
\end{array}\right]\left[\begin{array}{ccccccccccc}a\\b
\end{array}\right] =
a \left[\begin{array}{ccccccccccc}3\\4\\2
\end{array}\right] + b \left[\begin{array}{ccccccccccc}4\\1\\3
\end{array}\right],
$$
a linear combination of the two columns of $X.$ So our
interest lies in finding the linear combination of the columns of
X that is closest to $\v y.$
Mathematically, this amounts to solving the normal equations
$$
(X'X)\hv \beta = X'\v y.
$$
(Why?)
We want$ (\v y-X\hv \beta)$ to be perpendicular
to $\col(X).$ So we need $X' (\v y-X\hv \beta) =
0.$
Do the normal equations always produce unique
solution? Not necessarily. For example, if the linear model is
$$
2.9 = \beta_1 + \beta_2 + \epsilon_1\\
3.0 = \beta_1 + \beta_2 + \epsilon_2\\
2.5 = \beta_3 + \epsilon_3,
$$
then you can never hope to get $\h\beta_1$ and $\h\beta_2$
uniquely. But $\h\beta_3$ may be found uniquely.
Also, $\h\beta_1+\h\beta_2$ is unique, i.e.,
whatever least squares solutions $\h\beta_1$ and $\h\beta_2$ you take, their sum
will always be the same.
The foot of the perpendicular (from $\v y$
to $\col(X)$) is $\hv y$, and is unique. Since this is
in $\col(X),$ so it can be expressed as a linear combination
of the columns of $X.$ However, there may be many ways to do
so. It will be unique if and only if the columns of $X$ are
all independent.
Thus, the projection of $\v y$ onto $\col(X)$ is
$$\hv y = X\hv \beta = \underbrace{X(X'X)^{-1} X'}_{P_X} \v y.$$
So $P_X = X(X'X)^{-1} X' $ is the (orthogonal) projection
matrix for $\col(X).$
Recall from linear algebra that a real matrix is an orthogonal
projection matrix if and only if it is symmetric and idempotent.
EXERCISE: Quickly check that $P_X$ is indeed symmetric and
orthogonal.
Solve the following approximate system using R:
$$\begin{eqnarray*}
3a + 4b + c & \approx & 3.4\\
3a + 4b + c & \approx & 3.5\\
4a + 3b + 2c & \approx & 10.1\\
4a + 3b + 2c & \approx & 9.8\\
6a + 5b + 2c & \approx & 5.6
\end{eqnarray*}$$
Let's see how R tackles a linear model where the design matrix that is not full
column rank:
$$
X = \left[\begin{array}{ccccccccccc}
1 & 1 & 0\\
1 & 1 & 0\\
1 & 0 & 1\\
1 & 0 & 1\\
\end{array}\right],\quad
\v y = \left[\begin{array}{ccccccccccc}3.4\\3.5\\10.5\\10.3
\end{array}\right].
$$
Here the first column of $X$ is a column of $1$'s. So
you may just type the last two columns in R, and omit the -1.
In the problem above R produced one least squares
solution. But we know that there are infinitely many. Write down
two more solutions. Can you write a general form for all least
squares solutions here?
R automatically stores various qunatities computed
by lm. We shall explore some of them here. Let's
work with the linear model from the last exercise. Create the
full design matrix (including its first column) and type:
myfit = lm(y~X-1)
The variable myfit now contains lots of the
information about the fit. You may extract the computed least
squares solution $\hv \beta $ as
myfit$coef
This may be used in future computations. Compute $\h y = X\hv
\beta.$ Remember that %*% is the R notation for
matrix multiplication. This $\h y$ is the foot of the
perpendicular dropped from $\v y$ to $\col(X).$
Usually $\hv y$ is called the fitted vector. R already
computes them:
myfit$fitted
The vector $\v y - \hv y$ is called the residual
vector:
myfit$resid
There are many other pieces of information packed
in myfit:
names(myfit)
The closer $\v y$ is to $\col(X),$ the better is
the fit of the linear model. Here are various ways to measure
this "closeness":
How close $\|\v y-\hv y\|^2$ is to 0.
How close $\frac{\|\hv y\|^2 }{ \|\v y\|^2 }$ is to 1.
Comment why these are not "good" measures of fit. Come up with
pairs of examples where the fit is equally good (according to
commonsense) yet these measures say otherwise.
Consider a linear model $\v y = X \beta +\epsilon,$
where $X$ is not full col rank. Pick any basis
of $\col(X).$ Stack these vectors side by side a columns to
get a matrix $B.$ Let $\v w = B(B'B) ^{-1} B' \v y.$
Show that $\v w = \hv y$ irrespective of the choice
of $B.$
Consider a linear model with design matrix
$$
X = \left[\begin{array}{ccccccccccc}
1 & 1 & 0\\
1 & 1 & 0\\
1 & 0 & 1\\
1 & 0 & 1\\
\end{array}\right].
$$
If $\v \beta = (\beta_1, \beta_2, \beta_3)',$ then show that
whatever least squares solution $\hv \beta $ you take, $\h
\beta_2-\h \beta_3$ is always the same. Characterise all
vectors $\v \ell\in{\mathbb R}^3$ such that $\v \ell' \hv \beta$
does not depend on the choice of the least squares solution.
Generalise the characterisation from the last problem to
arbitrary design matrix.
Redo the above problem with the extra condition: $\beta_0-\alpha_0 = (\beta_1-a_1) x_0.$
Comments
To post an anonymous comment, click on the "Name" field. This
will bring up an option saying "I'd rather post as a guest."