extended discussion of the conjugate gradient method

Suppose we wish to solve the system

A\\mathbf{x}=\\mathbf{b}

(1)

where $A$ is a symmetric positive definite matrix. If we define the function

\\Phi(\\mathbf{x})=\\frac{1}{2}\\mathbf{x}^{T}A\\mathbf{x}-\\mathbf{x}^{T}\\mathbf{b}

(2)

we realise that solving (1) is equivalent to minimizing $\\Phi$ . This is because if $\\mathbf{x}$ is a minimum of $\\Phi$ then we must have

\abla\\Phi(\\mathbf{x})=A\\mathbf{x}-\\mathbf{b}=\\mathbf{0}

(3)

These considerations give rise to the steepest descent algorithm. Given an approximation $\\mathbf{\\tilde{x}}$ to the solution $\\mathbf{x}$ , the idea is to improve the approximation by moving in the direction in which $\\Phi$ decreases most rapidly. This direction is given by the gradient of $\\Phi$ in $\\mathbf{\\tilde{x}}$ . Therefore we formulate our algorithm as follows

Given an initial approximation $\\mathbf{x}^{(0)}$ , for $k\\in 1...N$ ,
$\\mathbf{x}^{(k)}=\\mathbf{x}^{(k-1)}+\\alpha_{k}\\mathbf{r}^{(k)}$
where $\\mathbf{r}^{(k)}=\\mathbf{b}-A\\mathbf{x}^{(k-1)}=-\abla\\Phi(\\mathbf{x}^{(k-1)})$ and $\\alpha_{k}$ is a scalar to be determined.

$\\mathbf{r}^{(k)}$ is traditionally called the residual vector. We wish to choose $\\alpha_{k}$ in such a way that we reduce $\\Phi$ as much as possibile in each iteration, in other words, we wish to minimize the function $\\phi(\\alpha_{k})=\\Phi(\\mathbf{x}^{(k-1)}+\\alpha_{k}\\mathbf{r}^{(k)})$ with respect to $\\alpha_{k}$

\\begin{array}[]{rcl}\\phi^{\\prime}(\\alpha_{k})&=&\abla\\Phi(\\mathbf{x}^{(k-1)}+%\\alpha_{k}\\mathbf{r}^{(k)})^{T}\\mathbf{r}^{(k)}\\\\&=&[A\\mathbf{x}^{(k)}-\\mathbf{b}+\\alpha_{k}A\\mathbf{r}^{(k)}]^{T}\\mathbf{r}^{(%k)}\\\\&=&[-\\mathbf{r}^{(k)}+\\alpha_{k}A\\mathbf{r}^{(k)}]^{T}\\mathbf{r}^{(k)}\\\\&=&0\\end{array}\\Longleftrightarrow\\alpha_{k}=\\frac{\\mathbf{r}^{(k)}{}^{T}%\\mathbf{r}^{(k)}}{\\mathbf{r}^{(k)}{}^{T}A\\mathbf{r}^{(k)}}

It’s possible to demonstrate that the steepest descent algorithm described above converges to the solution $\\mathbf{x}$ , in an infinite time. The conjugate gradient method improves on this by finding the exact solution after only $n$ iterations. Let’s see how we can achieve this.

We say that $\\mathbf{\\tilde{x}}$ is optimal with respect to a direction $\\mathbf{d}$ if $\\lambda=0$ is a local minimum for the function $\\Phi(\\mathbf{\\tilde{x}}+\\lambda\\mathbf{d})$ .

In the steepest descent algorithm, $\\mathbf{x}^{(k)}$ is optimal with respect to $\\mathbf{r}^{(k)}$ , but in general it is not optimal with respect to $\\mathbf{r}^{(0)},...,\\mathbf{r}^{(k-1)}$ . If we could modify the algorithm such that the optimality with respect to the search directions is preserved we might hope that that $\\mathbf{x}^{(n)}$ is optimal with respect to $n$ linearly independent directions, at which point we would have found the exact solution.

Let’s make the following modification

	$\\displaystyle\\mathbf{p}^{(k)}$	$\\displaystyle=$	$\\displaystyle\\begin{cases}\\mathbf{r}^{(1)}&\\text{if }k=1\\\\\\mathbf{r}^{(k)}-\\beta_{k}\\mathbf{p}^{(k-1)}&\\text{if }k>1\\end{cases}$		(4)
	$\\displaystyle\\mathbf{x}^{(k)}$	$\\displaystyle=$	$\\displaystyle\\mathbf{x}^{(k-1)}+\\alpha_{k}\\mathbf{p}^{(k)}$		(5)

where $\\alpha_{k}$ and $\\beta_{k}$ are scalar multipliers to be determined

We choose $\\alpha_{k}$ in the same way as before, i.e. such that $\\mathbf{x}^{(k)}$ is optimal with respect to $\\mathbf{p}^{(k)}$

\\alpha_{k}=\\frac{\\mathbf{r}^{(k)}{}^{T}\\mathbf{p}^{(k)}}{\\mathbf{p}^{(k)}{}^{T%}A\\mathbf{p}^{(k)}}

(6)

Now we wish to choose $\\beta_{k}$ such that $\\mathbf{x}^{(k)}$ , is also optimal with respect to $\\mathbf{p}^{(k-1)}$ . We require that

\\left.\\frac{\\partial\\Phi(\\mathbf{x}^{(k)}+\\lambda\\mathbf{p}^{(k-1)})}{\\partial%\\lambda}\\right|_{\\lambda=0}=[A\\mathbf{x}^{(k)}-\\mathbf{b}]^{T}\\mathbf{p}^{(k-1%)}=0

(7)

Since $\\mathbf{x}^{(k)}=\\mathbf{x}^{(k-1)}+\\alpha_{k}\\mathbf{p}^{(k)}$ , and assuming that $\\mathbf{x}^{(k-1)}$ is optimal with respect to $\\mathbf{p}^{(k-1)}$ (i.e. that $[A\\mathbf{x}^{(k-1)}-\\mathbf{b}]^{T}\\mathbf{p}^{(k-1)}=0$ ) we can rewrite this condition as

[A(\\mathbf{x}^{(k-1)}+\\alpha_{k}\\mathbf{p}^{(k)})-\\mathbf{b}]^{T}\\mathbf{p}^{(%k-1)}=\\alpha_{k}[\\mathbf{r}^{(k)}-\\beta_{k}\\mathbf{p}^{(k-1)}]^{T}A\\mathbf{p}^%{(k-1)}=0

(8)

and therefore we obtain the required value for $\\beta_{k}$ ,

\\beta_{k}=\\frac{\\mathbf{r}^{(k)}{}^{T}A\\mathbf{p}^{(k-1)}}{\\mathbf{p}^{(k-1)}{%}^{T}A\\mathbf{p}^{(k-1)}}

(9)

Now we want to show that $\\mathbf{x}^{(k)}$ is also optimal with respect to $\\mathbf{p}^{(1)},...,\\mathbf{p}^{(k-2)}$ , i.e. that

[A\\mathbf{x}^{(k)}-\\mathbf{b}]^{T}\\mathbf{p}^{(j)}=\\mathbf{r}^{(k+1)}{}^{T}%\\mathbf{p}^{(j)}=0\\qquad\\forall\\;j\\in 1...k-2

(10)

We do this by strong induction on $k$ , assuming that for all $\\ell\\in 1...k-1,j\\in 1...\\ell$ , $\\mathbf{x}^{(\\ell)}$ is optimal with respect to $\\mathbf{p}^{(j)}$ or equivalently that

[A\\mathbf{x}^{(\\ell)}-\\mathbf{b}]^{T}\\mathbf{p}^{(j)}=\\mathbf{r}^{(\\ell+1)}{}^%{T}\\mathbf{p}^{(j)}=0

(11)

Noticing that $\\mathbf{r}^{(k+1)}=\\mathbf{r}^{(k)}+\\alpha_{k}A\\mathbf{p}^{(k)}$ , we want to show

\\mathbf{r}^{(k)}{}^{T}\\mathbf{p}^{(j)}+\\alpha_{k}\\mathbf{p}^{(k)}{}^{T}A%\\mathbf{p}^{(j)}=0

(12)

and therefore since $\\mathbf{r}^{(k)}{}^{T}\\mathbf{p}^{(j)}=0$ by inductive hypothesis, it suffices to prove that

\\mathbf{p}^{(k)}{}^{T}A\\mathbf{p}^{(j)}=0\\qquad\\forall\\;j\\in 1...k-2

(13)

But, again by the definition of $\\mathbf{p}^{(k)}$ , this is equivalent to proving

\\mathbf{r}^{(k)}{}^{T}A\\mathbf{p}^{(j)}-\\beta_{k}\\mathbf{p}^{(k-1)}{}^{T}A%\\mathbf{p}^{(j)}=0

(14)

and since $\\mathbf{p}^{(k-1)}{}^{T}A\\mathbf{p}^{(j)}=0$ by inductive hypothesis, all we need to prove is that

\\mathbf{r}^{(k)}{}^{T}A\\mathbf{p}^{(j)}=0

(15)

To proceed we require the following lemma.

Let $V_{\\ell}=\\mathrm{span}\\{\\;A^{\\ell-1}\\mathbf{r}^{(1)},...,A\\mathbf{r}^{(1)},%\\mathbf{r}^{(1)}\\;\\}$ .
•
$\\mathbf{r}^{(\\ell)},\\mathbf{p}^{(\\ell)}\\in V_{\\ell}$
•
If $\\mathbf{y}\\in V_{\\ell}$ then $\\mathbf{r}^{(k)}\\perp\\mathbf{y}$ for all $k>\\ell$ .

Since $\\mathbf{p}^{(j)}\\in V_{j}$ , it follows that $A\\mathbf{p}^{(j)}\\in V_{j+1}$ , and since $j+1<k$ ,

\\mathbf{r}^{(k)}{}^{T}(A\\mathbf{p}^{(j)})=0

(16)

To finish let’s prove the lemma. The first point is by induction. The base case $\\ell=1$ holds. Assuming that $\\mathbf{r}^{(\\ell-1)},\\mathbf{p}^{(\\ell-1)}\\in V_{\\ell-1}$ , we have that

\\mathbf{r}^{(\\ell)}=\\mathbf{r}^{(\\ell-1)}+\\alpha_{\\ell-1}A\\mathbf{p}^{(\\ell-1)%}\\in V_{\\ell}\\qquad\\mathbf{p}^{(\\ell)}=\\mathbf{r}^{(\\ell)}-\\beta_{\\ell}\\mathbf%{p}^{(\\ell-1)}\\in V_{\\ell}

(17)

For the second point we need an alternative characterization of $V_{\\ell}$ . Since $\\mathbf{p}^{(1)},...,\\mathbf{p}^{(\\ell)}\\in V_{\\ell}$ ,

\\mathrm{span}\\{\\;\\mathbf{p}^{(1)},...,\\mathbf{p}^{(\\ell)}\\;\\}\\subseteq V_{\\ell}

(18)

By (11), we have that if for some $s\\in 1...\\ell$

\\mathbf{p}^{(s)}=\\lambda_{s-1}\\mathbf{p}^{(s-1)}+\\cdots+\\lambda_{1}\\mathbf{p}^%{(1)}

(19)

then $\\mathbf{p}^{(s)}{}^{T}\\mathbf{r}^{(s)}=0$ , but we know that

\\mathbf{p}^{(s)}{}^{T}\\mathbf{r}^{(s)}=[\\mathbf{r}^{(s)}-\\beta_{s}\\mathbf{p}^{%(s-1)}]^{T}\\mathbf{r}^{(s)}=\\mathbf{r}^{(s)}{}^{T}\\mathbf{r}^{(s)}

(20)

but were this zero, we’d have $\\mathbf{r}^{(s)}=\\mathbf{0}$ and we would have solved the original problem. Thus we conclude that $\\mathbf{p}^{(s)}\ot\\in\\mathrm{span}\\{\\;\\mathbf{p}^{(1)},...,\\mathbf{p}^{(s-1)%}\\;\\}$ , so the vectors $\\mathbf{p}^{(1)},...,\\mathbf{p}^{(s)}$ are linearly independent. It follows that $\\mathrm{dim}\\;\\mathrm{span}\\{\\;\\mathbf{p}^{(1)},...,\\mathbf{p}^{(\\ell)}\\;\\}=\\ell$ , which on the other hand is the maximum possible dimension of $V_{\\ell}$ and thus we must have

V_{\\ell}=\\mathrm{span}\\{\\;\\mathbf{p}^{(1)},...,\\mathbf{p}^{(\\ell)}\\;\\}

(21)

which is the alternative characterization we were looking for.Now we have, again by (11), that if $\\mathbf{y}\\in V_{\\ell}$ , and $\\ell<k$ , then

\\mathbf{r}^{(k)}{}^{T}\\mathbf{y}=0

(22)

thus the second point is proven.