EXAMPLE:
Sometimes we make repeated measurements under the same input
combination to get an idea of the effect of the random errors. If
these repeated measurements are made quickly in succession, then
these measurements may be correlated. A typical model could be
like
$$
y_{ij} = \mu + \alpha_i + \epsilon_{ij},
$$
where $i=1,...,3$ and $j=1,2.$ Here
$\epsilon_{i1}, \epsilon_{i2}$ could have some unknown
correlation $\rho.$ Thus the covariance matrix of $\v
\epsilon $ is $\sigma^2\\Sigma(\rho),$
where $\Sigma(\rho)$ is the correlation matrix of the
following form:
$$
\Sigma(\rho) = \left[\begin{array}{ccccccccccc}
1 & \rho & 0 & 0 & 0& 0\\
\rho & 1 & 0 & 0 & 0& 0\\
0 & 0 & 1 & \rho & 0& 0\\
0 & 0 & \rho & 1 & 0& 0\\
0 & 0 & 0 & 0 & 1 & \rho\\
0 & 0 & 0 & 0 & \rho & 1
\end{array}\right].
$$

gls function of the R package nlme
performs linear model fitting in presence of correlated,
heterscedastic errors. Notice the slight mismatch in
terminology. The standard definition of GLS assumes
a known correlation matrix, while R
function gls allows the correlation matrix to have
unknown parameters. The exact form is
$$
\sigma^2 D \Sigma D,
$$
where $D$ is a diagonal matrix involving some
parameters, $\Sigma$ is a correlation matrix involving a
different set of parameters, and $\sigma^2$ is a scalar parameter.
There are are some standard form of
correlation matrices already built-in:
corAR1 : autoregressive process of order 1.
corARMA : autoregressive moving average process, with arbitrary orders
for the autoregressive and moving average components.
corCAR1 : continuous autoregressive process (AR(1) process for a
continuous time covariate).
corCompSymm : compound symmetry structure corresponding to a constant
correlation.
corExp : exponential spatial correlation.
corGaus : Gaussian spatial correlation.
corLin : linear spatial correlation.
corRatio : Rational quadratics spatial correlation.
corSpher : spherical spatial correlation.
corSymm : general correlation matrix, with no additional structure.
Here is an example taken from the gls documentation.
The data set is Ovary that counts the number of
follicles in the ovaries of 11 mares during a menstrual
cycle. The response variable is a count, but here we treat it as
a continuous variable. The input is time.
gls(follicles ~ sin(2*pi*Time) + cos(2*pi*Time), Ovary,
weights = varPower(), #Specifying the structure of the D matrix
correlation = corAR1(form = ~ 1 | Mare)) #Specifying the structure of the Sigma matrix
EXAMPLE: Suppose our data consist of $X_1,X_2$ IID $N(\mu,1).$ Then we know how to form the likelihood: just use the joint density. However, if you are also given $X_3 = X_1+X_2,$ then the joint distribution of $(X_1,X_2,X_3)$ is singular (it resides in a lower dimensional space in ${\mathbb R}^3$), and hence has no density. How do form the likelihood here?
SOLUTION: Consider the situation intuitively: $X_3$ does not bring any extra information. So just drop it, and stick to $(X_1,X_2)$ as our data. So the likelihood of $(X_1,X_2)$ is still the likelihood of $(X_1,X_2,X_3).$ Of course, we could as well have kept $(X_1,X_3),$ and thrown away $X_2.$ That would also give us a likelihood. A little thinking should convince you that this also leads to the same likelihood. For example, suppose the data are $X_1=3.4,$ $X_2 = 2.3$ and $X_3 = 5.7.$ Clearly, $\{X_1=3.4,\,X_2=2.3\}$ and $\{X_2=2.3,\,X_3=5.7\}$ and $\{X_1=3.4,\,X_3=5.7\}$ are the same event!
So when we are faced with singular data distribution, we throw
away the redundancy (any way is as good as any other way), and
then report the likelihood based on the remaining data.
Now we are in a position to explain REML. We shall start with an
example.
EXAMPLE: Consider $y_1, y_2, y_3$ IID $N(\mu, \sigma^2).$ We shall cast it as a linear model. Then the residuals are $e_i = y_i-\b y$'s. The REML method is to apply ML based on the $e_i$'s. Notice that $(e_1, e_2, e_3)$ has a singular distribution, since $\sum e_i = 0.$ We shall throw away $e_3$. Then the joint distribution of $(e_1,e_2)$ is normal with mean 0 and $$ cov(e_1,e_2) = cov(y_1,y_2) + var(\b y) - cov(y_1,\b y) - cov(y_2,\b y) = = - \frac{\sigma^2}{3}. $$ Also, $V(e_1)= V(e_2) = \frac{2 \sigma^2}{3}.$
Thus the covariance matrix is $\Sigma = \sigma^2\cdot A$, where $A=\frac 13\left[\begin{array}{ccccccccccc}2 & -1\\-1&2 \end{array}\right].$ The log-likelihood is $$ \ell(\sigma^2) = \mbox{constant} - \frac 12\log \sigma^2 - \frac{1}{2\sigma^2} (e_1\quad e_2)A^{-1} \left[\begin{array}{ccccccccccc}e_1\\e_2 \end{array}\right]. $$ A little computation would show that $(e_1\quad e_2)A^{-1} \left[\begin{array}{ccccccccccc}e_1\\e_2 \end{array}\right] = e_1^2+e_2^2+e_3^2.$ So solving $\ell'(\sigma^2) = 0,$ we shall get $$ \h\sigma^2 = \frac{e_1^2+e_2^2+e_3^2}{2}, $$ which is our familiar formula.
Even this simple example was not entirely straightforward,
especially that "little computation" part. The following example
takes a more careful approach so that we can deal with any
linear model.
EXAMPLE: This time we start with a linear model in the general form $$ y = X \beta + \epsilon. $$ Obtain REML of $\sigma^2$ by applying ML on $ e.$
SOLUTION: We know that $ e = (I-P_X) y\in \col(I-P_X).$ We shall remove redundancy from $ e$ cleverly this time. We shall take any ONB of $\col(I-P_X),$ and stack the vectors as columns to create a matrix $B_{n\times (n-r)}.$ Though $ e$ consists of $n$ numbers, yet expressed w.r.t. this basis there are only $n-r$ numbers, $w_1,...,w_{n-r}.$ If we write $w = (w_1,...,w_{n-r})',$ then $w = B' e.$ Since $B'P = O,$ hence $w = B' e = B'(I-P_X) y = B'y.$ So $w\sim N(0,\sigma^2 B'B) = N(0,\sigma^2 I_{n-r}).$ See the advantage of being clever while removing the redundancy? The log-likelihood based on $w$ is $$ \ell(\sigma^2) = \log \sigma^2 + \frac{w'w}{\sigma^2}. $$ Solving $\ell'(\sigma^2) = 0,$ we immediately get $$ \h \sigma^2 = \frac{w'w}{n-r} = \frac{e'e}{n-r}, $$ as expected.
Such an nice analytical expression was possible because we
assumed the dispersion matrix of $\epsilon $ to
be $\sigma^2 I.$ Had we assumed it to be
some $\Sigma(\theta),$ where $\theta$ is unknown (but
the form of $\Sigma(\cdot)$ being known), we would face a
bit more trouble.
First, $\h \beta $ may not have a closed-form expression any
more. The maximisation w.r.t. $\beta$ and $\theta$ are
entangled together. We shall proceed step by step: Let $\h
\beta(\theta)$ be the likelihood maximiser for a given $\theta.$
We shall apply ML on $e(\theta) = y-X \beta(\theta).$
$$
\ell(\theta) = \log |\Sigma(\theta)| + w' \Sigma(\theta) ^{-1} w.
$$
After this, REML has to proceed numerically.
weight length 1 23.1 1 23.0 2 25.3 2 25.5 3 26.9 3 27.1 4 29.7 4 29.8 5 31.9 5 31.9We want to fit the regression model $y_i = \alpha + \beta x_i + \epsilon_i,$ where $x$ is the weight hung from a spring, and $y$ is the resulting length of the spring. We assume that $\epsilon_i$'s have $N(0, \sigma^2)$ distribution for some unknown $\sigma^2>0,$ and that measurements involving the same weights are correlated with unknown correlation $\rho\in[-1,1].$ Estimate all the parameters ($\alpha,\beta,\sigma^2,\rho$) using any of the techniques discussed in this page.