$$ \newcommand{\arginf}{\mathrm{arginf}} \newcommand{\argmin}{\mathrm{argmin}} \newcommand{\argmax}{\mathrm{argmax}} \newcommand{\asconv}[1]{\stackrel{#1-a.s.}{\rightarrow}} \newcommand{\Aset}{\mathsf{A}} \newcommand{\b}[1]{{\mathbf{#1}}} \newcommand{\ball}[1]{\mathsf{B}(#1)} \newcommand{\bbQ}{{\mathbb Q}} \newcommand{\bproof}{\textbf{Proof :}\quad} \newcommand{\bmuf}[2]{b_{#1,#2}} \newcommand{\card}{\mathrm{card}} \newcommand{\chunk}[3]{{#1}_{#2:#3}} \newcommand{\condtrans}[3]{p_{#1}(#2|#3)} \newcommand{\convprob}[1]{\stackrel{#1-\text{prob}}{\rightarrow}} \newcommand{\Cov}{\mathbb{C}\mathrm{ov}} \newcommand{\cro}[1]{\langle #1 \rangle} \newcommand{\CPE}[2]{\PE\lr{#1| #2}} \renewcommand{\det}{\mathrm{det}} \newcommand{\dimlabel}{\mathsf{m}} \newcommand{\dimU}{\mathsf{q}} \newcommand{\dimX}{\mathsf{d}} \newcommand{\dimY}{\mathsf{p}} \newcommand{\dlim}{\Rightarrow} \newcommand{\e}[1]{{\left\lfloor #1 \right\rfloor}} \newcommand{\eproof}{\quad \Box} \newcommand{\eremark}{</WRAP>} \newcommand{\eqdef}{:=} \newcommand{\eqlaw}{\stackrel{\mathcal{L}}{=}} \newcommand{\eqsp}{\;} \newcommand{\Eset}{ {\mathsf E}} \newcommand{\esssup}{\mathrm{essup}} \newcommand{\fr}[1]{{\left\langle #1 \right\rangle}} \newcommand{\falph}{f} \renewcommand{\geq}{\geqslant} \newcommand{\hchi}{\hat \chi} \newcommand{\Hset}{\mathsf{H}} \newcommand{\Id}{\mathrm{Id}} \newcommand{\img}{\text{Im}} \newcommand{\indi}[1]{\mathbf{1}_{#1}} \newcommand{\indiacc}[1]{\mathbf{1}_{\{#1\}}} \newcommand{\indin}[1]{\mathbf{1}\{#1\}} \newcommand{\itemm}{\quad \quad \blacktriangleright \;} \newcommand{\jointtrans}[3]{p_{#1}(#2,#3)} \newcommand{\ker}{\text{Ker}} \newcommand{\klbck}[2]{\mathrm{K}\lr{#1||#2}} \newcommand{\law}{\mathcal{L}} \newcommand{\labelinit}{\pi} \newcommand{\labelkernel}{Q} \renewcommand{\leq}{\leqslant} \newcommand{\lone}{\mathsf{L}_1} \newcommand{\lp}[1]{\mathsf{L}_{{#1}}} \newcommand{\lrav}[1]{\left|#1 \right|} \newcommand{\lr}[1]{\left(#1 \right)} \newcommand{\lrb}[1]{\left[#1 \right]} \newcommand{\lrc}[1]{\left\{#1 \right\}} \newcommand{\lrcb}[1]{\left\{#1 \right\}} \newcommand{\ltwo}[1]{\PE^{1/2}\lrb{\lrcb{#1}^2}} \newcommand{\Ltwo}{\mathrm{L}^2} \newcommand{\mc}[1]{\mathcal{#1}} \newcommand{\mcbb}{\mathcal B} \newcommand{\mcf}{\mathcal{F}} \newcommand{\meas}[1]{\mathrm{M}_{#1}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\normmat}[1]{{\left\vert\kern-0.25ex\left\vert\kern-0.25ex\left\vert #1 \right\vert\kern-0.25ex\right\vert\kern-0.25ex\right\vert}} \newcommand{\nset}{\mathbb N} \newcommand{\N}{\mathcal{N}} \newcommand{\one}{\mathsf{1}} \newcommand{\PE}{\mathbb E} \newcommand{\pminfty}{_{-\infty}^\infty} \newcommand{\PP}{\mathbb P} \newcommand{\projorth}[1]{\mathsf{P}^\perp_{#1}} \newcommand{\Psif}{\Psi_f} \newcommand{\pscal}[2]{\langle #1,#2\rangle} \newcommand{\pscal}[2]{\langle #1,#2\rangle} \newcommand{\psconv}{\stackrel{\PP-a.s.}{\rightarrow}} \newcommand{\qset}{\mathbb Q} \newcommand{\revcondtrans}[3]{q_{#1}(#2|#3)} \newcommand{\rmd}{\mathrm d} \newcommand{\rme}{\mathrm e} \newcommand{\rmi}{\mathrm i} \newcommand{\Rset}{\mathbb{R}} \newcommand{\rset}{\mathbb{R}} \newcommand{\rti}{\sigma} \newcommand{\section}[1]{==== #1 ====} \newcommand{\seq}[2]{\lrc{#1\eqsp: \eqsp #2}} \newcommand{\set}[2]{\lrc{#1\eqsp: \eqsp #2}} \newcommand{\sg}{\mathrm{sgn}} \newcommand{\supnorm}[1]{\left\|#1\right\|_{\infty}} \newcommand{\thv}{{\theta_\star}} \newcommand{\tmu}{ {\tilde{\mu}}} \newcommand{\Tset}{ {\mathsf{T}}} \newcommand{\Tsigma}{ {\mathcal{T}}} \newcommand{\ttheta}{{\tilde \theta}} \newcommand{\tv}[1]{\left\|#1\right\|_{\mathrm{TV}}} \newcommand{\unif}{\mathrm{Unif}} \newcommand{\weaklim}[1]{\stackrel{\mathcal{L}_{#1}}{\rightsquigarrow}} \newcommand{\Xset}{{\mathsf X}} \newcommand{\Xsigma}{\mathcal X} \newcommand{\Yset}{{\mathsf Y}} \newcommand{\Ysigma}{\mathcal Y} \newcommand{\Var}{\mathbb{V}\mathrm{ar}} \newcommand{\zset}{\mathbb{Z}} \newcommand{\Zset}{\mathsf{Z}} $$
Let $\b{X}$ be a $n \times p$ matrix with real-valued entries. We define $\img(\b{X})=\set{\b{X}\b{y}}{\b{y} \in \rset^p}$ and $\ker(\b{X})=\set{\b{y} \in \rset^p}{\b{X}\b{y}=0}$. We can note that $\img(\b{X})$ and $\ker(\b{X}^T)$ are subspaces of $\rset^n$.
$\itemm$ $\img(\b{X}) \stackrel{\perp}{\oplus} \underbrace{\img(\b{X})^\perp}_{\ker(\b{X}^T)}= \rset^n$. Otherwise stated, $x$ is the orthogonal projection of $y$ on $\b{X}$ if and only if we have the two properties $x \in \img(\bf{X})$ and $(y-x) \in \ker(\b{X}^T)$. The fact that $\ker(\b{X}^T)$ is the orthogonal complement of $\img(\b{X})$ stems from the following remark: $z \in \img(\b{X})^\perp$ is equivalent to the fact that $\b{X}_i^T z=0$ where $\b{X}_1,\ldots,\b{X}_p$ are the column vectors of $\b{X}$ and this in turn is equivalent to $\bf{X}^Tz=0$.
$\itemm$ Denote by $P_{\img(\b{X})}$ the matrix of the orthogonal projection on $\img(\b{X})$. By abuse of notation, we also write $P_{\b{X}}$. We have $P_\b{X}=P_\b{X}^2=P_\b{X}^T$ and we can also note that $P_\b{X}=I$ on $\img(\b{X})$.
$\itemm$ An orthogonal projection is uniquely determined by the subspace on which it projects. This implies in particular the following property. Assume that $\b{X}$ is a $n \times p$ matrix and $\b{W}$ is a $n \times k$ matrix where we can possibly have $k \neq p$. As soon as $\img(\b{X})=\img(\b{W})$, we have $P_\b{X}=P_{\img(\b{X})}=P_{\img(\b{W})}=P_\b{W}$
$\itemm$ By the Pythagore identity, for all $a\in \rset^n$, we have $\|a\|^2=\|P_\b{X}a\|^2+\|(I-P_\b{X})a\|^2$ ($\star$) showing that $\|a\|\geq \|P_\b{X}a\|$ with equality only if $P_\b{X}a=a$.
$\itemm$ $\b{X}^g$ is a pseudo inverse of $\b{X}$ (and we note $\b{X}^g$) iff $\b{X} \b{X}^g \b{X}=\b{X}$ or, equivalently, for all $\lambda \in \img(\b{X})$, $\b{X} \b{X}^g \lambda=\lambda$. We admit that a pseudo inverse always exists.
Let $\b{y}$ be a vector of size $n$ and $\b{X}$ a $n \times p$ matrix. Since $P_\b{X}\b{y}$ is in $\img(\b{X})$, it can be written as $P_\b{X} \b{y}=\b{X} \b{\hat b}$ for some $\b{\hat b} \in \rset^p$.
The fundamental result. Theorem. The following properties are equivalent
Moreover, $P_\b{X}=\bf{X}(\bf{X}^T \bf{X})^g\bf{X}^T$ for any choice of the pseudo inverse $(\bf{X}^T \bf{X})^g$.
A side effect of this theorem is that $\bf{X}(\bf{X}^T \bf{X})^g\bf{X}^T$ does not depend on the choice of the pseudo inverse $(\bf{X}^T \bf{X})^g$.
We can now give the general solutions of the normal equations.
Solving the Normal equations. Theorem. The following equivalence holds true. $\b{\hat b}$ solves the normal equations iff there exists $\b{z}\in \rset^p$ such that $$ \b{\hat b}=(\bf{X}^T \bf{X})^g\bf{X}^T \b{y}+ \lrb{I-(\bf{X}^T \bf{X})^g\bf{X}^T \b{X}} \b{z} $$ for some pseudo-inverse $(\bf{X}^T \bf{X})^g$.
We now consider $\b{y}=\b{X}\b{b}+e$ where $e$ is a zero mean vector with covariance matrix $\sigma^2 I_n$. We say that $\lambda^T \b{b}$ is estimable if there exists $a\in \rset^n$ such that $a^T\b{y}$ is an unbiased estimator of $\lambda^T \b{b}$, which is equivalent to $\lambda^T \b{b}=a^T\b{X}\b{b}$ for all $\b{b}\in \rset^p$ and therefore to $\lambda^T=a^T\b{X}$.
The Gauss-Markov Theorem . For any linear unbiased estimator $a^T \b{y}$ of $\lambda^T \b{b}$, $$ \Var(a^T \b{y})\geq \Var(\lambda^T \b{\hat b}) $$ where $\hat b$ is the least square estimator of $\b{b}$. We then say that $\lambda^T \b{\hat b}$ is BLUE (Best Linear Unbiased Estimator). Moreover, the equality only holds if $a^T \b{y}=\lambda^T\b{\hat b}$, saying that the BLUE is unique.
In the curse of the proof, we have seen that if $\lambda^T \b{b}$ is estimable, then choosing $a$ such that $\lambda^T=a^T\b{X}$, $$ \Var(\lambda^T \b{\hat b})= \sigma^2 \|P_\b{X}a\|^2=\sigma^2 a^T P_\b{X}^T P_\b{X} a=\sigma^2 a^T P_\b{X} a=\sigma^2 a^T \b{X} (\b{X}^T\b{X})^g \b{X}^T a=\sigma^2 \lambda^T (\b{X}^T\b{X})^g \lambda, $$ and a side effect is that $\lambda^T (\b{X}^T\b{X})^g \lambda$ does not depend on the chosen pseudo-inverse whenever $\lambda \in \img(\b{X}^T)$.