The Least Squares Method
The least squares method finds the best-fitting line through data by minimizing squared errors - the foundation of linear regression and machine learning. Instead of memorizing formulas, weâll discover the solution by working backwards from what we want to achieve.
Begin with the End in Mind: A Mathematical Approach
Stephen Covey wrote about âbeginning with the end in mindâ as a principle for life planning. The same idea applies to solving mathematical problems: start with what you want to achieve and ask âWhat conditions would make this true?â
For simple calculus problems, everyone does this automatically. To find the minimum of f(x), solve fâ(x) = 0. Easy.
But when problems get complex, where no textbook has your exact setup, we often forget this approach. Imagine designing a communication system to maximize signal-to-noise ratio with constraints on power and bandwidth. No formula exists. So you ask: âWhat would make SNR maximal? What must be true about my filter coefficients at that maximum?â Work through those conditions, and youâve derived your answer.
The backward-working method is powerful when problems get unfamiliar. Thatâs exactly what weâll do here: start with what we want (minimize the error) and work backwards to discover the normal equations.
Why Do We Need This?
Imagine measuring temperature vs. ice cream sales. You have 100 data points but want only 2 parameters: slope and intercept. You have 100 equations but only 2 unknowns. No single line passes through all points perfectly.
Least squares finds the line that gets as close as possible to all points by minimizing the total distance.
A Note on Norms: Why Our Choice Matters
Before we dive into the mathematics, we need to address something fundamental: what does âminimize the distanceâ actually mean?
Distance can be measured in different ways. Each choice leads to a completely different problem with different solutions:
\(\|\mathbf{r}\|_1\) (L1 norm) = \(|r_1| + |r_2| + \ldots + |r_n|\) Sum of absolute errors. Leads to linear programming. Solutions tend to be sparse (many zeros). Robust to outliers. Used in LASSO regression and compressed sensing.
\(\|\mathbf{r}\|_2\) (L2 norm) = \(\sqrt{r_1^2 + r_2^2 + \ldots + r_n^2}\) Euclidean distance. Leads to least squares. Solutions are smooth and unique. Has closed-form solution. Optimal when errors are Gaussian. This is what weâll use.
\(\|\mathbf{r}\|_\infty\) (Lâ norm) = \(\max(|r_1|, |r_2|, \ldots, |r_n|)\) Maximum absolute error. Leads to Chebyshev approximation. Minimizes worst-case error. Used when you need to bound peak deviations.
We choose the 2-norm because:
- Itâs differentiable everywhere (enables calculus-based optimization)
- Leads to a linear system with closed-form solution
- Has strong statistical justification (maximum likelihood under Gaussian noise)
- Results in orthogonal projection (elegant geometry)
Everything that follows depends on this choice. With L1, youâd need linear programming. With Lâ, youâd use minimax optimization. The 2-norm gives us the cleanest path.
The Problem Setup
Now letâs translate our intuition into mathematics. We have:
- A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) representing our system
- A measured vector \(\mathbf{b} \in \mathbb{R}^m\) containing our observations
- \(m\) = number of measurements (equations)
- \(n\) = number of parameters to estimate (unknowns)
- The key: \(m > n\) (more equations than unknowns)
Think of \(\mathbf{A}\) as encoding the relationship between parameters and measurements. Each row of \(\mathbf{A}\) corresponds to one measurement. The vector \(\mathbf{b}\) holds what we actually observed.
Our goal: find \(\mathbf{x}^*\) such that:
\[\| \mathbf{A}\mathbf{x}^* - \mathbf{b} \|_2 = \min_{\mathbf{x}\in\mathbb{R}^n} \| \mathbf{A}\mathbf{x} - \mathbf{b}\|_2 \tag{1}\]The vector \(\mathbf{r} = \mathbf{A}\mathbf{x} - \mathbf{b}\) is the residual: how far off our approximation is from reality. We want to make this as small as possible.
Deriving the Normal Equations
Hereâs where the backward-working method pays off. We want to minimize equation (1). What condition must hold at the minimum?
The gradient must equal zero. That single insight unlocks everything.
Letâs work through it. Minimizing the 2-norm is the same as minimizing its square (since square root is monotonic):
\[f(\mathbf{x}) = \| \mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2 = (\mathbf{A}\mathbf{x} - \mathbf{b})^T(\mathbf{A}\mathbf{x} - \mathbf{b})\]Expanding this product:
\[f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{b}^T\mathbf{A}\mathbf{x} + \mathbf{b}^T\mathbf{b}\]Now take the gradient with respect to x:
\[\nabla f(\mathbf{x}) = 2\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{A}^T\mathbf{b}\]Set it to zero:
\[2\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{A}^T\mathbf{b} = 0\]Divide by 2 and we get the normal equations:
\[\mathbf{A}^T\mathbf{A}\mathbf{x} = \mathbf{A}^T\mathbf{b} \tag{2}\]We discovered this by asking âwhat must be true at the minimum?â The residual \(\mathbf{r} = \mathbf{A}\mathbf{x} - \mathbf{b}\) is orthogonal to the column space of \(\mathbf{A}\) at the optimal solution. Weâre projecting \(\mathbf{b}\) onto the space spanned by \(\mathbf{A}\)âs columns.
When Does a Solution Exist?
Great, we have the normal equations. But can we always solve them? And if so, is the solution unique?
The good news: the normal equations always have at least one solution. But the character of that solution depends entirely on the rank of \(\mathbf{A}\).
Case 1: Full Rank (rank(A) = n)
When \(\mathbf{A}\) has full column rank, something nice happens. The matrix \(\mathbf{A}^T\mathbf{A}\) is:
- Symmetric (obviously: \((\mathbf{A}^T\mathbf{A})^T = \mathbf{A}^T\mathbf{A}\))
- Positive definite (since \(\mathbf{x}^T\mathbf{A}^T\mathbf{A}\mathbf{x} = \|\mathbf{A}\mathbf{x}\|^2 > 0\) for \(\mathbf{x} \neq 0\))
This means \((\mathbf{A}^T\mathbf{A})^{-1}\) exists and is unique. We can directly solve for \(\mathbf{x}\):
\[\mathbf{x}^* = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b} \tag{3}\]One matrix, one solution. Clean and simple.
Case 2: Rank Deficient (rank(A) < min(m,n))
Now things get interesting. When \(\mathbf{A}\) lacks full rank, \(\mathbf{A}^T\mathbf{A}\) becomes singular. It has no inverse. Multiple solutions exist that all minimize the residual equally well.
Which one do we choose? The convention is to pick the solution with smallest norm:
\[\mathbf{x}^* = \underset{\mathbf{x} \in X}{\text{argmin}} \|\mathbf{x}\|\]where \(X\) is the set of all solutions to the normal equations.
This is where we need a more sophisticated tool: the pseudoinverse.
The Pseudoinverse
Look at equation (3) again. That term \((\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\) shows up constantly in statistics, signal processing, and machine learning. We give it a special name: the pseudoinverse of \(\mathbf{A}\), written \(\mathbf{A}^\dagger\):
\[\mathbf{A}^\dagger = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\]Think of the pseudoinverse as âthe closest thing to an inverseâ for rectangular or singular matrices. Regular inverses only exist for square, full-rank matrices. The pseudoinverse extends this concept.
To be a proper pseudoinverse, \(\mathbf{A}^\dagger\) must satisfy four conditions (the Penrose conditions):
- \[\mathbf{A} \mathbf{A}^\dagger \mathbf{A} = \mathbf{A}\]
- \[\mathbf{A}^\dagger \mathbf{A} \mathbf{A}^\dagger = \mathbf{A}^\dagger\]
- \[(\mathbf{A} \mathbf{A}^\dagger)^H = \mathbf{A} \mathbf{A}^\dagger\]
- \[(\mathbf{A}^\dagger \mathbf{A})^H = \mathbf{A}^\dagger \mathbf{A}\]
These ensure \(\mathbf{A}^\dagger\) gives the minimum norm solution when multiple solutions exist. But how do we actually compute it when \(\mathbf{A}\) is rank deficient?
Computing the Pseudoinverse via SVD
Enter the Singular Value Decomposition (SVD). Itâs one of the most powerful tools in numerical linear algebra, and it gives us a direct way to compute the pseudoinverse even when \(\mathbf{A}\) is rank deficient.
Hereâs the idea: every matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) can be decomposed as:
\[\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\]where:
- \(\mathbf{U} \in \mathbb{R}^{m \times m}\) is orthogonal (its columns are orthonormal)
- \(\mathbf{V} \in \mathbb{R}^{n \times n}\) is orthogonal (its columns are orthonormal)
- \(\Sigma \in \mathbb{R}^{m \times n}\) is diagonal with structure:
where \(\Sigma_r = \text{diag}(\sigma_1, \sigma_2, \ldots, \sigma_r)\) with \(\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0\).
The values \(\sigma_i\) are the singular values. The number of non-zero singular values equals the rank of \(\mathbf{A}\). They tell you how much the matrix stretches space in each direction.
Now hereâs the elegant part. Using SVD, the pseudoinverse is simply:
\[\mathbf{A}^\dagger = \mathbf{V}\Sigma^\dagger\mathbf{U}^T\]where
\[\Sigma^\dagger = \begin{bmatrix} \Sigma_r^{-1} & 0 \\ 0 & 0 \end{bmatrix}\]We invert only the non-zero singular values. The zeros stay zero. This elegantly handles rank deficiency: we canât divide by zero, so we donât.
Special case: When \(\mathbf{A}\) is square and full rank (\(\text{rank}(\mathbf{A}) = m = n\)), then \(\mathbf{A}^\dagger = \mathbf{A}^{-1}\). The pseudoinverse becomes the regular inverse.
The Condition Number: When Can You Trust Your Solution?
Hereâs an uncomfortable truth: just because you can compute a solution doesnât mean you should trust it.
Computing \(\mathbf{A}^T\mathbf{A}\) for the normal equations squares the condition number of \(\mathbf{A}\). This amplifies numerical errors. Small uncertainties in your measurements can explode into huge errors in your solution. SVD avoids forming \(\mathbf{A}^T\mathbf{A}\), which is why itâs often more reliable.
But what is this âcondition numberâ anyway?
The condition number measures how sensitive your solution is to tiny changes in the input:
\[\kappa(\mathbf{A}) = \frac{\max_{\|\mathbf{x}\| = 1} \|\mathbf{A}\mathbf{x}\|}{\min_{\|\mathbf{x}\| = 1} \|\mathbf{A}\mathbf{x}\|}\]For the 2-norm, this has a beautiful interpretation using singular values:
\[\kappa_2(\mathbf{A}) = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_r}\]Itâs the ratio of the largest to smallest singular value. A large condition number means some directions get stretched way more than others. Your matrix is nearly singular in some direction.
What this means in practice:
- \(\kappa(\mathbf{A}) \approx 10^k\) â lose about k digits of accuracy in your answer
- \(\kappa(\mathbf{A}) < 100\) â well-conditioned, trust your solution
- \(\kappa(\mathbf{A}) > 10^{10}\) â ill-conditioned, be very suspicious
Imagine youâre estimating channel coefficients in a communication system. Your measurement matrix has \(\kappa(\mathbf{A}) = 10^8\). Even if your sensor is accurate to 12 decimal places, your estimated coefficients are only good to about 4 decimal places. The remaining 8 digits? Amplified noise.
What do you do? Use truncated SVD: throw away the smallest singular values when computing \(\mathbf{A}^\dagger\). Youâre deliberately ignoring the directions where \(\mathbf{A}\) barely stretches space. Those directions just amplify noise anyway.
Putting It All Together
So which method should you use? It depends on your matrix and how much you trust your data.
If your matrix is well-conditioned and has full rank, use the normal equations:
\[\mathbf{x} = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b}\]This is fast to compute. Form \(\mathbf{A}^T\mathbf{A}\), invert it, done. But remember: youâre squaring the condition number. Only do this when \(\kappa(\mathbf{A})\) is small.
If your matrix is rank deficient or poorly conditioned, use SVD-based pseudoinverse:
\[\mathbf{x} = \mathbf{A}^\dagger\mathbf{b} = \mathbf{V}\Sigma^\dagger\mathbf{U}^T\mathbf{b}\]This costs more computationally but is much more numerically stable. The SVD reveals the true rank and handles singularities gracefully.
If you suspect your measurements are noisy, use truncated SVD. Drop the smallest singular values when forming \(\mathbf{A}^\dagger\). Youâre trading some bias for dramatically reduced variance. The smallest singular values amplify noise anyway, so throwing them out often improves your actual solution quality even though mathematically itâs âless optimal.â
The least squares method underpins regression, machine learning, signal processing, and control systems. But the devil is in the numerical details. Understanding when to use which variant makes the difference between results you can trust and numerical nonsense.
References
- Stephen M. Kogon, Dimitris G. Manolakis, Vinay K. Ingle. âStatistical and Adaptive Signal Processingâ. Artech House, 2005.
- Proofs involving the Moore-Penrose inverse.
- S. Kacem. âKompensation von frequenzselektiver I/Q-Imbalance in breitbandigen Direktmisch-Sendernâ. Studienarbeit 2011.
- Mathpedia, Methode der kleinsten Quadrate