[Home]

Analysis
Generalising derivatives to higher dimensions
Generalisation of the chain rule
Second derivative and curvature
Integration by substitution in higher dimension
Probabality
Markov chain
Statistics
PCA
Least squares
Partial correlation
Covariance matrix and Mahalanobis Distance
MLE
Part 1
Part 2
Determinant trick
Splitting a projection

$ \newcommand{\v}{\vec} \newcommand{\col}{{\mathcal C}} \newcommand{\vb}[1]{\vec{\bar{#1}}} \newcommand{\bc}{\because} \newcommand{\ip}[1]{\langle #1 \rangle}$ Applications Here we shall briefly touch upon various applications of linear algebra that you will encounter during the B.Stat course.

Analysis

Generalising derivatives to higher dimensions

Let $f:{\mathbb R}\rightarrow{\mathbb R}$ be a differentiable function. Then its derivative $f'(x)$ is another function from ${\mathbb R}$ to ${\mathbb R}.$ From this once may naively expect that the higher dimensional analogs of these concepts would be such that the "derivative" of a "differentiable" function from ${\mathbb R}^n$ to ${\mathbb R}^m$ would be another function from ${\mathbb R}^n$ to ${\mathbb R}^m.$

Unfortunately, that is not the case. The actual generalisation, though quite intuitive, is somewhat indirect, and uses linear transformations.

Why do we care about differentiability? Because, certain curves may be nicely approximated by a straight line locally, while certain other curves may not. When we say $f(x)$ has derivative $5$ at $x=1$ we mean that for $x\approx 1$ $$ f(x)-f(1) \approx 5(x-1). $$ This is made rigorous as $$ \lim_{x\rightarrow 1}\frac{|f(x)-f(1)-5(x-1)|}{|x-1|} = 0. $$ This may look different from the familiar definition of differentiation, but it is the same thing.

Well, "multiplication by 5" is a linear transformation. So the intuitive way to define differentiability of a function $F:{\mathbb R}^n\rightarrow{\mathbb R}^m$ at a point $\v p\in{\mathbb R}^n$ is that for $\v x\approx\v p$ we should have $$ F(\v x)-F(\v p) \approx D(\v x-\v p), $$ for some linear transformation $D:{\mathbb R}^n\rightarrow{\mathbb R}^m.$

Of course, this may be made rigorous much as in the 1-dim case, except that we use norm instead of absolute value: $$ \lim_{\v x\rightarrow \v p}\frac{\|F(\v x)-F(\v p)-D(\v x-\v p)\|}{\|\v x-\v p\|} = 0. $$ Careful here: the norm in the numerator is Euclidean norm in ${\mathbb R}^m$ while the denominator is the Euclidean norm in ${\mathbb R}^n.$

Thus the analog of the "derivative at a point" is a linear transformation! So if $F:{\mathbb R}^n\rightarrow{\mathbb R}^m$ is differentiable everywhere then the "derivative" is a function from ${\mathbb R}^n$ to the sapce of all linear transformations from ${\mathbb R}^n\rightarrow{\mathbb R}^m$!

It is cuustomary to express a linear transformation from ${\mathbb R}^n$ to ${\mathbb R}^m$ by an $m\times n$ matrix w.r.t. the Euclidean bases of ${\mathbb R}^n$ and ${\mathbb R}^m.$ If you do that it will be seen that the matrix representation of $D$ is just what we call the Jacobian matrix with $(i,j)$-th entry $$ \frac{\partial F_i}{\partial x_j}. $$ So it is not uncommon to refer to this matrix (instead of the linear transformation $D$) as the derivative. If you do so, something funny happens. It is quite possible that the "derivative" exists even though $F(\v x)$ is not differentiable! The function $F:{\mathbb R}^2\rightarrow{\mathbb R}$ defined by $$ F(x,y) = \left\{\begin{array}{ll}1 &\text{if }x=0\mbox{ or } y=0\\\mbox{any damn formula} &\text{otherwise.}\end{array}\right. $$ is one such example. Just visualise the function as a surface, and mentally compute the Jacobian at $(0,0)$ to see what I mean.

However, if you (correctly) stick to defining the derivative as the linear transformation (not just a matrix), then no such funny thing happens.

Generalisation of the chain rule

We are all familiar with the chain rule: if $f(x) = g(h(x))$ then $$ f'(1) = g'(h(1)) \times h'(1). $$ You may guess its generalisation for higher dimensions. If $H:{\mathbb R}^n\rightarrow{\mathbb R}^m$ and $G:{\mathbb R}^m\rightarrow{\mathbb R}^k$ and $F = G\circ H,$ then for a point $\v x\in{\mathbb R}^m$ we have $$ D_{F,\v x} = D_{G,H(\v x)} \circ D_{H,\v x}. $$ Here $D_{F,\v x}$ means the linear transformation that the denotes the derivative of $F$ at $\v x.$

Of course, you represent all the derivatives using the Jacobian matrices, then you get matrix multiplication: $$ J_{F,\v x} = J_{G,H(\v x)} J_{H,\v x}. $$ It is instructive to check that the sizes of the matrices allow you to perform the multiplication.

Second derivative and curvature

The first derivative of a function ${\mathbb R}^n$ to ${\mathbb R}^m$ turned out to be a linear transformation, the second derivative of a function from ${\mathbb R}^n$ to ${\mathbb R}$ turns out to be a quadratic form (represented by a real symmetric matrix called the Hessian).

To understand this consider a smooth function $f:{\mathbb R}\rightarrow{\mathbb R}.$ Take any $a\in{\mathbb R}.$ Then near $a$ the function may be well-approximated by its tangent: $y=f(a)+f'(a)(x-a).$ Of course, $f(x)$ need not be exactly equal to this if $x\neq a.$, since $f(x)$ may have a curvature, while the tangent (being of the form $y=c + m x$) is just straight. If we allow a quadartic approximation, then we hope to come closer to $f(x).$ A good approximation is $$ y = f(a)+f'(a)(x-a) + \frac{f''(a)}{2}(x-a)^2. $$ A similar thing happens even if $f:{\mathbb R}^n\rightarrow{\mathbb R}.$ Then the role of $f'(\v a)$ is played by the Jacobian matrix (which is $1\times n$). In this context (ie, when the domain is ${\mathbb R}^n$ and codomain is ${\mathbb R}$) the Jacobian has another name: the grad and denoted by $\nabla f(\v a).$ Thus $$ \nabla f(\v a) = \left[\begin{array}{ccccccccccc} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n}. \end{array}\right] $$ In this case the quadratic approximation is $$ y = f(\v a) + \nabla f(\v a) (\v x-\v a) + \frac12 (\v x-\v a)' H (\v x-\v a), $$ where $H$ is the $n\times n$ Hessian matrix with $(i,j)$-th entry $$ \frac{\partial^2 f}{\partial x_i\partial x_j}. $$ Under moderate restriction on $f$ (which we assume) $H$ is a symmetric matrix.

When we are trying to find max/min of $f(\v x)$ we naturally equate $\nabla f(\v a)$ to $0.$ All the values of $\v a$ where this holds are the candidate points where max/min may occur.

At any such $\v a$ we have $$ f(\v x) - f(\v a) \approx \frac12 (\v x-\v a)' H (\v x-\v a), $$ which is a quadratic form in $(\v x-\v a).$ Think of $f(\v x)$ as equal to this quadratic form over some neighbourhood of $\v a$ in ${\mathbb R}^n.$ Then it is not difficult to see that

$f$ has a strict local min at $\v a$ if $H$ is PD.
$f$ has a strict local max at $\v a$ if $H$ is PD.
$f$ has a saddle point at $\v a$ if $H$ is ID.
nothing conclusive can be said otherwise.

Integration by substitution in higher dimension

We are all familiar with the integration by substitution for function from ${\mathbb R}$ to ${\mathbb R}.$ If we substitute $x = u(t)$, then $$ \int_a^b f(x) dx = \int_{u ^{-1} (a)}^{u ^{-1} (b)} f(u(t)) u'(t) dt. $$ While we have been using this since our high school days, we are not often aware that for this $u$ needs to be a bijection (else $u ^{-1} $ would not make sense). Also it needs to be differentiable.

If we give $u ^{-1} $ a name, say $v.$ Then the result is $$ \int_a^b f(x) dx = \int_{v (a)}^{v (b)} f(v ^{-1}(t)) \frac{dv ^{-1}(t)}{dt} dt. $$ In this form it is more commonly called the change-of-variable formula.

Here the new variable is expressed as a function of the existing variable $t = v(x).$ This function needs to be a bijection (else $v ^{-1} $ cannot be defined), and the inverse needs to be differentiable.

The multi-dimensional generalisation of this formula is remarkably similar, except for a strange occurence of determinant:

Suppose that we are integrating $f:{\mathbb R}^n\rightarrow {\mathbb R}$ over some $S\subseteq{\mathbb R}^n,$ ie, we are computing $$ \int_S f(\v x) d\v x. $$ That single integral sign now represents an $n$-dimensional integration. We are also given a function $v:{\mathbb R}^n\rightarrow{\mathbb R}^n$ that transforms $S$ to $v(S)$ and is a bijection between $S$ and $v(S).$ We also assume that $v ^{-1} $ is differentiable over $v(S).$ Let $J(\v t)$ be the Jacobian matrix of $v ^{-1} $ computed at some point $\v t\in v(S).$ Then the change-of-variable formula becomes $$ \int_S f(\v x) d\v x = \int_{v(S)} f(v ^{-1} (\v t)) det(J(\v t)) d\v t. $$ A couple of points are worth noting:

the Jacobian is present only as its determinant.
there is no absolute value around determinant.

Probabality

Markov chain

Markov chain is a fancy term for a game of ludo played with a single counter. There is a board (called the state space) and the counter moves from a position to another following a die (or some other randomisation rule). Each position is called a state. A real ludo has different marks (arrows, snakes, ladders etc) on the board to guide the counter's motion (or transition) from state to state. This role is played by a matrix (called the transition matrix) in a Markov chain. If there are $n$ states, then it is an $n\times n$ matrix. If the counter is at state $i$ now, then $(i,j)$-th entry is the probability that in the next step it will be at state $j.$

If $X_k$ denotes the state of the counter at step $k,$ then the $(i,j)$-th entry of the transition matrix $P$ is $$ p_{ij} = P(X_{k+1}=j|X_k=i). $$ We are assuming that this is free of $k$ (ie, the same die is rolled at each step).

An example will help here.

For the following board


Ludo

w2 the state space is $\{1,2,...,23\}.$ Note that the unstable squares (mouth on snake and bottom of ladder) are not considered as states. Here the $1$-st row of the transition matrix contains $\frac16$ in columns $2,3,4,5,6$ and $16.$ The remaining entries in the row are zero.

Notice that $P^2$ is the 2-step transition matrix, ie, $$ ((P^2))_{ij} = P(X_{k+2}=j|X_k=i). $$ Similarly, $P^k$ is the $k$-step transition matrix.

$X_1$ is the initial position of the counter. Its distribution depends on the game starting rule. For example, if the counter always starts at $1,$ then $X_1$ is degenerate at $1.$ If the counter is kept outside the board, and enters according to the roll of a fair die, then $P(X_1=i) = \frac16$ if $i=1,2,3,4,5,16$ and $0$ else.

Whatever, the initial distribution, it can be specified by a $1\times n$ probability vector. Call it $\v p_1'.$ Note the transpose sign, since it is a row vector. Convince yourself that the distribution of $X_2$ is $$ \v p_2' = \v p_1' P. $$ In general, the distribution of $X_k$ is $$ \v p_k' = \v p_{k-1}' P. $$ For many important Markov chains the vector sequence $(\v p_1,\v p_2,\v p_3,...)$ converges to some probability vector $\v\pi.$ Then, taking limit of the last equality, we have $$ \v \pi' = \v\pi' P. $$ If only the equation were $P\v\pi=\v\pi,$ we would have said $\v\pi$ is an eigenvector of $P$ corresponding to eigenvalue 1. But here have $\v\pi$ to the left of $P.$ So we say that $\v\pi$ is a left eigenvector of $P$ corresponding to (left) eigenvalue 1. Of course, we could have also said: $\v\pi$ is an eigenvector of $P'$ corresponding to eigenvalue 1.

Anyway, we do not need the adjective "left" in front of eigenvalue, because of the following result.

EXERCISE:Show that for any matrix $A$ the eigenvalues of $A$ and $A'$ are the same.

However, left eigenvectors are indeed different from the usual (or right) eigenvectors. We said that the equality $\v \pi' = \v\pi' P$ results from the fact that for certain Markov chains the distributions of $X_k$'s converge. This gives rise to certain mathematical questions, which are presented in the problems below.

EXERCISE:Show that for any stochastic matrix, $1$ is an eigenvalue.

EXERCISE:Show that for every stochastic matrix there is a left eignvector which is a probability vector.

EXERCISE:Show that for a stochastic matrix, all the eigenvalues $\lambda$ must satisfy $|\lambda|\leq 1.$ Also if $|\lambda|=1,$ then $\lambda=1.$

Statistics

PCA

Consider the faces below:

If you want to describe any one of them then there are two ways to do it:

You can do it like a camera, take a small enough unit, and then start describing every square unit of area.

This technique is very precise, but generates too much data to describe a single face.
The second method is to pick a few important features of the faces (eg, spectacles, moustache), and describe only those features. Thus

In this technique we do lose some information, but still we are happy, because we have retained enough information.

To translate this idea to linear algebra, think of each key feature as a long line, eg, one line for moustache, another for spectacles. Each point along the moustache line represents one partixular type of moustache. Similarly each point along the spectacles line denotes a spectacle. In terms of linear algebra, each key feature is a 1-dim subspace. Each vector in the moustache subspace is one type of moustache. Similarly, each vector in the spectacles subspace is a type of spectacles.

PCA is a method to identify these subspaces.

The most desirable property of a key feature is that the individuals should vary a lot w.r.t. that feature. "The murderer has one head, two hands and two legs" is not a very useful description of the criminal, because there is hardly any variation in the numbers of heads, hands and legs among humans. But height, hairstyle and fingerprint are very informative, because there is a lot of variability there. So PCA actually looks for subspaces along which the data points vary the most.

An example will help here. The file pca.dat has a data set with 3 columns, X,Y and $Z.$ The interactive 3D scatterplot is shown here. Play with it to convince yourself that all the data points are actually lying almost on a plane. This means that the data vary in only two directions, not three. If we set up a new coordinate system with a point in this plane as the origin, and two vectors along the plane as the $x$- and $y$-axes, then each point may be expressed as just two numbers, instead of three.

Here we say that the intrinsic dimension of the data is $2,$ even though the apparent dimension is $3.$ The terms "intrinsic" and "apparent" are not standard, by the way.

There are are many real life data sets where the intrinsic dimension is substantially smaller than the apparent dimension. Actually, the intrinsic dimension usually depends on the complexity of the underlying system, while the apparent dimension depends on the elaborateness of the measurement. For example, if we photograph human faces, then the intrinsic dimension depends on the complexity of human faces, while the apparent dimension depends on the resolution of the camera.

Given a high dimensional data, we usually want to know the intrinsic dimension, and are happy if it is much smaller than the apparent dimension. This is precisely what PCA is useful for. is a very popular tool for this.

Suppose we $p$ measurements made on each of $n$ subects to produce a data matrix $X_{n\times p},$ where $p$ is large. Let the $p\times 1$ vector $\v x_i$ denote the measurements made for the $i$-th subject. PCA looks for the directions along which the spread of the data is the maximum. Let a typical direction be given by a unit vector $\v \ell_{p\times 1}.$ If the covariance matrix of the data is $S_{p\times p},$ then the amount of spread aong this direction is given by $$ \v\ell' S\v \ell. $$ We know that is maximum iff $\v\ell$ is an eigenvector (say $\v\ell_1$) corresponding to the largest eigenvalue of $S.$ Then the direction of second maximum spread is given by a unit vector $\v\ell$ which is an eigenvector corresponding to the second largest eigenvalue, and so on. Also the variance of the data along thoe directions are the corresponding eigenvalues.

Here is a real life example from image processing.

Faces of 5 men and 5 women are photographed, 5 photos per person. Here are two typical photos:

Each photo is of size $200\times 180.$ Thus there are $36,000$ pixels in a single photo, and each pixel contains a single number (the brightness). So the $5\times5\times5=125$ photos are like $125$ points in ${\mathbb R}^{36000}.$ As PCA reveals, these points actually live in a $8$-dim subspace (approximately).

The eigenvectors of $S$ corresponding the 8 largest eigenvalues span this subspace. These eigenvectors are the "key features" as identified by the system. If we wrap back these 8 vectors from ${\mathbb R}^{36000}$ into images, they look like these:

Admittedly these look confusing. They do not correspond to our intuitive sense of "key features". But if fact, they are much more powerful than what we could have achieved by intuition alone. Also here is the mean face:

If we denote this by $\v\mu$ and the $8$ eigenvectors shown earlier by $\v v_1,...,\v v_8,$ then each of $125$ photos $\v x$ can be well-approximated by $$ \v x\approx \v\mu+\alpha_1\v v_i+\cdots+ \alpha_8\v v_8, $$ Here the coefficients $\alpha_1,...,\alpha_8$ depend on the particular photo $\v x.$ To give you an idea of the accuracy of this approximation, we show a sequence of successive approximations: $$ \v\mu+\sum_{i=1}^k \alpha_i \v v_i,~~k=0,...,8. $$

All these data are freely available on the web. Do a google search for "eigenface" to find these. If you are interested carrying out the analysis yourself in R, please drop me a line.

The basic steps are:

Load the data as a $125\times 36000$ matrix, each row being a photo, each column devoted to one pixel position.
Average the rows to get $\v \mu.$
From each column subtract its mean (ie, centre each column). Call the new matrix $X_{125\times 36000}.$
Since we have already done centering, so the covariance matrix is simply $X'X,$ which is a huge matrix $36000\times 36000.$ We want its eigenvalues and eigenvectors. We know that SVD of $A$ gives eigenvalues and eigenvectors of $A'A.$ So we just compute SVD of $X.$

Least squares

If we are given a (possibly inconsistent) system of linear equations $A\v x = \v b,$ then one way of solving it is via the least squares method. This means finding a $\v x_0$ such that $A\v x_0$ is the orthogonal projection of $\v b$ onto $\col(A).$

The entire course of Linear Models revolves around this idea.

Consider a typical situation where we have some system (treated as a blackbox) into which a number of inputs are fed, and out of which a single output is coming out. In all real life situations, some random issues also enter the system that we cannot eliminate or detect. So the output is only an approximate function of the inputs. Many such cases lead to an equation of the form $$ \v y_{n\times 1} = X_{n\times p}\v \beta_{p\times 1} + \v \epsilon_{n\times 1}. $$ Here we have reeated the "apply-inputs-and-see-output" process $n$ times (possibly with different input combinations). The output of the $i$-th case is $y_i.$ $\epsilon_i$ is the error input for that case. The $i$-th row of $X$ is made of the inputs applied in the $i$-th case. The $\v \beta $ vector are the unknown coefficients to be estimated that relate the inputs to the output (and thus is a property of the system).

A typical example could be height-weight relation for men and women. So the blackbox diagram looks like:

Here height $(h)$ and gender $(g)$ are the inputs (along with the inevitable random error), and weight $(w)$ is the output. If it is believed that the height-weight relation is like a striaght line for both genders, and both the straight lines the slope is the same but the intercept may possibly differ, then ideally $$ w = \left\{\begin{array}{ll}a_1 + b w&\text{if }\mbox{ male}\\a_2 + b w&\text{if }\mbox{ female}\\\end{array}\right.. $$ If we have collected height-weight data for 4 men and 3 women, and have denoted the heights and weights by $(h_i, w_i)$ where $i=1,...,4$ for men and $i=5,6,7$ for the women, then in matrix notation we have $\v y = X\v \beta + \v \epsilon,$ where $\v y = (h_1,...,h_7)'$ and $\v \epsilon = (\epsilon_1,...,\epsilon_7)'.$ Also $$ X = \left[\begin{array}{ccccccccccc} 1 & 0 & h_1\\ 1 & 0 & h_2\\ 1 & 0 & h_3\\ 1 & 0 & h_4\\ 0 & 1 & h_5\\ 0 & 1 & h_6\\ 0 & 1 & h_7 \end{array}\right] \mbox{ and } \v \beta = \left[\begin{array}{ccccccccccc}a_1\\a_2\\b \end{array}\right]. $$ This is an example of a linear model.

Notice that there are two inputs here (other than the random error). One input (height) is continuous, while the other (gender) is discrete. So this linear model is called an ANCOVA model. If there were no continuous input, then we would have called it an ANOVA model. If, on the other hand, only continuous inputs were present, then it would have been a regression model. But whatever the name, the same concept of orthogonal projection is used.

Partial correlation

The Schur complement of $D$ in $\left[\begin{array}{ccccccccccc}A & B \\ C & D \end{array}\right]$ is defined as $A-BD ^{-1} C.$ Of course, $D$ must be nonsingular for this to make sense.

This has an interesting application in statistics. Suppose that a sales department collects data on its sales persons. For each sales person two variables are measured: the age ($X_1$) and the average amount of daily sale ($X_2$). It is found that correlation of $X_1$ and $X_2$ is quite high and positive. From this the sales manager concludes that older sales persons are better, and hence orders the recruiting department to recruit only elderly persons.

Is this a wise decision?

No. The high positive correlation between $X_1$ and $X_2$ may not indicate that high $X_1$ causes $X_2$ to shoot up. In fact, it is more likely that experience with customer handling is main reason behind the improvement in sale. So the situation is like

more experience $\Rightarrow$ more sale
more experience means more age.

So recruiting fresh elderly persons would not help (since they have little experience).

But how to statistically validate the effect of experience? We can measure another variable for each sales person: experience in years ($X_3$). Then we can take out the effect of $X_3$ from both $X_1$ and $X_2$, and then compute the correlation between whatever is left of $X_1$ and $X_2.$

One popular way of doing this is as follows.

First orthogonally project $X_1$ onto $span\{1,X_3\}$ (ie, fit a leat squares straight line $\hat X_1 = a\cdot 1+ b\cdot X_3$) and then take the residual vector.
Similarly we extract the component of $X_2$ orthogonal to $span\{1,X_3\}.$
Now find the correlation between these orthogonal components. This correlation is called the partial correlation between $X_1$ and $X_2$ given $X_3.$ We also call $X_3$ the conditioning variable.

In general, there may be many conditioning variables.

It may sound like a rather complicated procedure, but it turns out that there is a surprising shortcut:

First compute the covariance matrix $\Sigma$ of $(X_3,X_1,X_2).$ Notice the ordering of the variables: the conditioning variables (here we have just one, $X_3$) are written first. We assume $\Sigma$ is PD.
Partition $\Sigma$ as $$ \Sigma = \left[\begin{array}{ccccccccccc}A & B\\B' & C \end{array}\right], $$ where $A$ corresponds to the conditioning variables. (Thus, in our example $A$ is $1\times 1$). Now compute the Schur complement of $A$ in $\Sigma:$ $$ C - B'A ^{-1} B. $$ This is a $2\times 2$ matrix, which is again PD. This is called the partial covariance matrix of $X_1, X_2$ given $X_3.$
Compute the correlation based on this matrix, (ie, if the entries are $d_{ij},$ then find $d_{12}/\sqrt{d_{11}d_{22}}$). This number is bound to be the required partial correlation!

If there are many conditioning variables, then $A$ will be a large matrix. So computing Schur complement directly may look daunting. Then there is another shortcut. SImply apply Gauss-Jordan elimination on $\Sigma$ and do this until $A$ reduces to $I.$ Then the matrix will look like $$ \left[\begin{array}{ccccccccccc} I & *\\ O & P \end{array}\right] $$ This $P$ will be the required Schur complement!

Covariance matrix and Mahalanobis Distance

We have learned about different inner products and norms. In general an inner product over ${\mathbb R}^n$ looks like $$ \ip{\v x, \v y} = \v x' M \v y, $$ where $M$ is a PD matrix. Euclidean matrix uses $M=I.$ Here is a situation where $M\neq I$ is a natural choice.

Consider the cartoon faces below:

All the faces are slight variations of each other. There is one face, however, which is just too much off! Which one?

The last one, of course! No normal human face has two eyes differing that much. Note the point here: eyes may differ a lot from face to face, but both the eyes on the same face are approximately mirror images of each other.

One way to quantify this in terms a norm (induced by an inner product), is via the Mahalanobis distance. To understand this, let us focus only on the horozontal lengths of the eyes (say $X$ for the left eye and $Y$ for the right). If we collect data on $(X,Y)$ from many faces, then we expect a scatterplot like this:

Notice that the green dot is further away than the blue dot from the red dot according to Euclidean norm. But intuitively, the blue face is much more different from an average face than the the green one is.

Mahalanobis proposed a new type of norm that coincides with our intuition. His proposed norm is $\|\v x\|,$ where $$ \|\v x\|^2 = \v x' \Sigma ^{-1} \v x. $$ Here $\Sigma$ is the covariance matrix (assumed PD). The presence of the $\Sigma ^{-1} $ essentially scales down the data by different amounts in different directions until all the directions have equal spread, and then computes Euclidean norm based on the scaled data. It is the multidimensional analog of stadardisation for 1-dim data: $\frac{x_i - \mu}{\sigma}.$

MLE

Part 1

In statistics we often have to estimate parameter values using a given data set. The most popular method for doing this is called maximum likelihood estimation (MLE), which requires us to maximise a particular function. In a frequently occuring set up involving the multivariate normal distribution the result $tr(AB)=tr(BA)$ proves useful as we show now.

At one place in the computation we need to minimise $$ \sum_{i=1}^n (\v x_i-\v \mu)' \Sigma ^{-1} (\v x_i -\v \mu ) $$ w.r.t. $\v\mu\in{\mathbb R}^k.$ Here $\Sigma_{k\times k} $ is the fixed PD matrix and $\v x_1,...,\v x_n$ are our data points in ${\mathbb R}^k.$

We proceed as follows. \begin{eqnarray*} \sum_{i=1}^n (\v x_i-\v \mu)' \Sigma ^{-1} (\v x_i -\v \mu ) & = & tr(\sum_{i=1}^n (\v x_i-\v \mu)' \Sigma ^{-1} (\v x_i -\v \mu ))~~\left[\mbox{$\bc tr(A_{1\times 1})=A$}\right]\\ & = & \sum_{i=1}^n tr((\v x_i-\v \mu)' \Sigma ^{-1} (\v x_i -\v \mu ))~~\left[\mbox{$\bc tr(A+B)=tr(A)+tr(B)$}\right]\\ & = & \sum_{i=1}^n tr( \Sigma ^{-1} (\v x_i -\v \mu ) (\v x_i-\v \mu)')~~\left[\mbox{$\bc tr(AB)=tr(BA)$}\right]\\ & = & tr( \sum_{i=1}^n \Sigma ^{-1} (\v x_i -\v \mu ) (\v x_i-\v \mu)')\\ & = & tr( \Sigma ^{-1} \sum_{i=1}^n (\v x_i -\v \mu ) (\v x_i-\v \mu)')\\ & = & tr( \Sigma ^{-1} \sum_{i=1}^n (\v x_i -\vb x+\vb x-\v \mu) (\v x_i-\vb x+\vb x\v \mu)')~~\left[\mbox{where $\vb x = \frac1n\sum\v x_i$}\right] \\ & = & \cdots\\ & = & tr(\Sigma ^{-1} (A+n (\vb x-\v \mu)(\vb x-\v \mu)')), \end{eqnarray*} where $A_{k\times k}$ is free of $\v \mu.$

Thus it is enough to minimise $tr(\Sigma ^{-1} (\vb x-\v \mu)(\vb x-\v \mu)')),$ which may be written as $tr((\vb x-\v \mu)'\Sigma ^{-1} (\vb x-\v \mu))) =(\vb x-\v \mu)'\Sigma ^{-1} (\vb x-\v \mu))), $ since it is a scalar.

Since $\Sigma $ is PD, so is $\Sigma ^{-1}.$ Hence the minimum value is $0$ attained iff $\v \mu = \vb x.$

Part 2

In the example above we optimised w.r.t. $\v \mu$ keeping the PD matrix $\Sigma $ fixed. In fact, we need to optimise w.r.t. $\Sigma $ as well. This eventually leads to the problem of maximising $$f(G) = n\log|G|-tr(GA), ~~G \mbox{ is PD}$$ where $A$ is a fixed PD matrix. Remember that the domain of $f$ is the set of all $k\times k$ PD matrices.

First we decompose $A$ as $A=CC',$ where $C$ is nonsignular. Then $$ f(G) = n\log|(C')^{-1} C' G C C ^{-1} |- tr(G C C') = n\log|(CC')^{-1}|+n\log| C' G C |- tr(C' G C ). $$ Let $\Delta = C'GC.$ Then (since $C$ is nonsingular) $\Delta $ takes all possible PD values when $G$ does so. Hence we can as well carry out the maximisation w.r.t. $\Delta.$ Then the function is $$ F(\Delta) = n\log|\Delta| - tr(\Delta). $$ If the eigenvalues of $\Delta_{k\times k} $ are $\lambda_1,...,\lambda_k,$ then $$ F(\Delta) = n\sum (\log \lambda_i - \lambda_i). $$ Now we can maximise w.r.t. the real variables $\lambda_i$'s (using calculus) to eventually conclude that the maximum occurs when $G = \frac1nA.$

Determinant trick

We know two formulae for determinant of a partitioned matrix $$ M=\left[\begin{array}{ccccccccccc}A & B \\ C & D \end{array}\right]. $$

If $A$ is nonsingular, then $|M| = |A|\cdot|D-CA ^{-1} B|.$
If $D$ is nonsingular, then $|M| = |D|\cdot|A-BD ^{-1} C|.$

These results may prove useful while dealing with ratios of determinants.

In a test of hypothesis problem we need to simplify the following ratio of determinants: $$ \frac{|\sum (\v x_i-\v \mu)(\v x_i-\v \mu)'|}{|\sum (\v x_i-\vb x)(\v x_i-\vb x)'|}. $$ A little manipulation (adding and subtracting $\vb x$) converts the numerator to $$ |A+n(\vb x-\v \mu)(\vb x-\v \mu)'|, $$ where $A$ is the denominator matrix. Now this is actually the determinant of $$ \left[\begin{array}{ccccccccccc} 1 & \v u\\-\v u' & A \end{array}\right], $$ where $\v u = \sqrt{n}(\vb x-\v\mu).$

Hence the ratio becomes $(1+\v u'A ^{-1} \v u),$ which is a lot more manageable.

Splitting a projection

Consider the (orthogonal) projection of ${\mathbb R}^3$ onto the $x$-$y$-plane: $(x,y,z)'\mapsto (x,y,0)'.$ You can split this projection into two parts: projecting onto $x$-axis, and projecting onto $y$-axis: $$ \left[\begin{array}{ccccccccccc}x\\y\\0 \end{array}\right] = \left[\begin{array}{ccccccccccc}x\\0\\0 \end{array}\right] + \left[\begin{array}{ccccccccccc}0\\y\\0 \end{array}\right]. $$ Since projections may be considered as idempotent matrices, hence this is actually a splitting of one idempotent matrix as the sum of two "simpler" (ie, lower rank) idempotent matrices: $$ \left[\begin{array}{ccccccccccc} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{array}\right] = \left[\begin{array}{ccccccccccc} 1 & 0 & 0\\ 0 & 0 & 0\\ 0 & 0 & 0 \end{array}\right] + \left[\begin{array}{ccccccccccc} 0 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 0 \end{array}\right]. $$ A couple of facts should be obvious intuitively:

for this to work the two smaller subspaces ($x$-axis and $y$-axis in our example) must not "overlap" in the sense that members of one subspace when projected to the other must vanish. In our example of orthogonal projections, this means they must be mutually orthogonal.
their dimensions should add up to the dimension of the original subspace ($x$-$y$-plane in our example).

In terms of matrices, we get the following theorems.

Theorem Let $A = A_1+\cdots A_k,$ where $A_i$'s are all idempotent, with $A_iA_j=O$ for $i\neq j.$

Then $A$ must be idempotent as well.

Proof: Direct squaring of $A$ will show that all cross-terms vanish. [QED]

Theorem Let $A = A_1+\cdots +A_k,$ where $A$ and $A_i$'s are all idempotent.

Then $r(A) = r(A_1)+\cdots+r(A_k).$

Proof: Just use the fact $r(B)=tr(B)$ for an idempotent matrix $B.$ [QED]

Theorem Let $A$ be idempotent and $A = A_1+\cdots+A_k.$ If $r(A) = r(A_1)+\cdots+r(A_k)$, then each $A_i$ must be idempotent and $A_i A_j=O$ for $i\neq j.$

Proof: Let $A_i = B_i C_i$ be a rank factorisation.

Notice that if we define $B = [B_1~~\cdots~~B_k]$ and $C = \left[\begin{array}{ccccccccccc}C_1\\\vdots\\C_k \end{array}\right],$ then $A = BC.$

Now the number of columns of $B$ equals $r(A)$ and hence this is a rank factorisation of $A.$

Since $A^2=A,$ we have $BCBC = BC,$ or $CB=I,$ since $B$ is full column rank (hence can be cancelled from left) and $C$ is full row rank (hence can be cancelled from right).

But this means $C_i B_i = I$ and $C_i B_j = O$ if $i\neq j.$

So $A_i^2 = A_i$ and $A_i A_j=O$ if $i\neq j.$ [QED]

HTML Comment Box is loading comments...

Table of contents