The least squares method finds the best-fitting line through data by minimizing squared errors - the foundation of linear regression and machine learning. Instead of memorizing formulas, we’ll discover the solution by working backwards from what we want to achieve.

Begin with the End in Mind: A Mathematical Approach

Stephen Covey wrote about “beginning with the end in mind” as a principle for life planning. The same idea applies to solving mathematical problems: start with what you want to achieve and ask “What conditions would make this true?”

For simple calculus problems, everyone does this automatically. To find the minimum of f(x), solve f’(x) = 0. Easy.

But when problems get complex, where no textbook has your exact setup, we often forget this approach. Imagine designing a communication system to maximize signal-to-noise ratio with constraints on power and bandwidth. No formula exists. So you ask: “What would make SNR maximal? What must be true about my filter coefficients at that maximum?” Work through those conditions, and you’ve derived your answer.

The backward-working method is powerful when problems get unfamiliar. That’s exactly what we’ll do here: start with what we want (minimize the error) and work backwards to discover the normal equations.

Why Do We Need This?

Imagine measuring temperature vs. ice cream sales. You have 100 data points but want only 2 parameters: slope and intercept. You have 100 equations but only 2 unknowns. No single line passes through all points perfectly.

Least squares finds the line that gets as close as possible to all points by minimizing the total distance.

A Note on Norms: Why Our Choice Matters

Before we dive into the mathematics, we need to address something fundamental: what does “minimize the distance” actually mean?

Distance can be measured in different ways. Each choice leads to a completely different problem with different solutions:

\(\|\mathbf{r}\|_1\) (L1 norm) = \(|r_1| + |r_2| + \ldots + |r_n|\) Sum of absolute errors. Leads to linear programming. Solutions tend to be sparse (many zeros). Robust to outliers. Used in LASSO regression and compressed sensing.

\(\|\mathbf{r}\|_2\) (L2 norm) = \(\sqrt{r_1^2 + r_2^2 + \ldots + r_n^2}\) Euclidean distance. Leads to least squares. Solutions are smooth and unique. Has closed-form solution. Optimal when errors are Gaussian. This is what we’ll use.

\(\|\mathbf{r}\|_\infty\) (L∞ norm) = \(\max(|r_1|, |r_2|, \ldots, |r_n|)\) Maximum absolute error. Leads to Chebyshev approximation. Minimizes worst-case error. Used when you need to bound peak deviations.

We choose the 2-norm because:

  1. It’s differentiable everywhere (enables calculus-based optimization)
  2. Leads to a linear system with closed-form solution
  3. Has strong statistical justification (maximum likelihood under Gaussian noise)
  4. Results in orthogonal projection (elegant geometry)

Everything that follows depends on this choice. With L1, you’d need linear programming. With L∞, you’d use minimax optimization. The 2-norm gives us the cleanest path.

The Problem Setup

Now let’s translate our intuition into mathematics. We have:

  • A matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) representing our system
  • A measured vector \(\mathbf{b} \in \mathbb{R}^m\) containing our observations
  • \(m\) = number of measurements (equations)
  • \(n\) = number of parameters to estimate (unknowns)
  • The key: \(m > n\) (more equations than unknowns)

Think of \(\mathbf{A}\) as encoding the relationship between parameters and measurements. Each row of \(\mathbf{A}\) corresponds to one measurement. The vector \(\mathbf{b}\) holds what we actually observed.

Our goal: find \(\mathbf{x}^*\) such that:

\[\| \mathbf{A}\mathbf{x}^* - \mathbf{b} \|_2 = \min_{\mathbf{x}\in\mathbb{R}^n} \| \mathbf{A}\mathbf{x} - \mathbf{b}\|_2 \tag{1}\]

The vector \(\mathbf{r} = \mathbf{A}\mathbf{x} - \mathbf{b}\) is the residual: how far off our approximation is from reality. We want to make this as small as possible.

Deriving the Normal Equations

Here’s where the backward-working method pays off. We want to minimize equation (1). What condition must hold at the minimum?

The gradient must equal zero. That single insight unlocks everything.

Let’s work through it. Minimizing the 2-norm is the same as minimizing its square (since square root is monotonic):

\[f(\mathbf{x}) = \| \mathbf{A}\mathbf{x} - \mathbf{b}\|_2^2 = (\mathbf{A}\mathbf{x} - \mathbf{b})^T(\mathbf{A}\mathbf{x} - \mathbf{b})\]

Expanding this product:

\[f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{b}^T\mathbf{A}\mathbf{x} + \mathbf{b}^T\mathbf{b}\]

Now take the gradient with respect to x:

\[\nabla f(\mathbf{x}) = 2\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{A}^T\mathbf{b}\]

Set it to zero:

\[2\mathbf{A}^T\mathbf{A}\mathbf{x} - 2\mathbf{A}^T\mathbf{b} = 0\]

Divide by 2 and we get the normal equations:

\[\mathbf{A}^T\mathbf{A}\mathbf{x} = \mathbf{A}^T\mathbf{b} \tag{2}\]

We discovered this by asking “what must be true at the minimum?” The residual \(\mathbf{r} = \mathbf{A}\mathbf{x} - \mathbf{b}\) is orthogonal to the column space of \(\mathbf{A}\) at the optimal solution. We’re projecting \(\mathbf{b}\) onto the space spanned by \(\mathbf{A}\)’s columns.

When Does a Solution Exist?

Great, we have the normal equations. But can we always solve them? And if so, is the solution unique?

The good news: the normal equations always have at least one solution. But the character of that solution depends entirely on the rank of \(\mathbf{A}\).

Case 1: Full Rank (rank(A) = n)

When \(\mathbf{A}\) has full column rank, something nice happens. The matrix \(\mathbf{A}^T\mathbf{A}\) is:

  • Symmetric (obviously: \((\mathbf{A}^T\mathbf{A})^T = \mathbf{A}^T\mathbf{A}\))
  • Positive definite (since \(\mathbf{x}^T\mathbf{A}^T\mathbf{A}\mathbf{x} = \|\mathbf{A}\mathbf{x}\|^2 > 0\) for \(\mathbf{x} \neq 0\))

This means \((\mathbf{A}^T\mathbf{A})^{-1}\) exists and is unique. We can directly solve for \(\mathbf{x}\):

\[\mathbf{x}^* = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b} \tag{3}\]

One matrix, one solution. Clean and simple.

Case 2: Rank Deficient (rank(A) < min(m,n))

Now things get interesting. When \(\mathbf{A}\) lacks full rank, \(\mathbf{A}^T\mathbf{A}\) becomes singular. It has no inverse. Multiple solutions exist that all minimize the residual equally well.

Which one do we choose? The convention is to pick the solution with smallest norm:

\[\mathbf{x}^* = \underset{\mathbf{x} \in X}{\text{argmin}} \|\mathbf{x}\|\]

where \(X\) is the set of all solutions to the normal equations.

This is where we need a more sophisticated tool: the pseudoinverse.

The Pseudoinverse

Look at equation (3) again. That term \((\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\) shows up constantly in statistics, signal processing, and machine learning. We give it a special name: the pseudoinverse of \(\mathbf{A}\), written \(\mathbf{A}^\dagger\):

\[\mathbf{A}^\dagger = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\]

Think of the pseudoinverse as “the closest thing to an inverse” for rectangular or singular matrices. Regular inverses only exist for square, full-rank matrices. The pseudoinverse extends this concept.

To be a proper pseudoinverse, \(\mathbf{A}^\dagger\) must satisfy four conditions (the Penrose conditions):

  1. \[\mathbf{A} \mathbf{A}^\dagger \mathbf{A} = \mathbf{A}\]
  2. \[\mathbf{A}^\dagger \mathbf{A} \mathbf{A}^\dagger = \mathbf{A}^\dagger\]
  3. \[(\mathbf{A} \mathbf{A}^\dagger)^H = \mathbf{A} \mathbf{A}^\dagger\]
  4. \[(\mathbf{A}^\dagger \mathbf{A})^H = \mathbf{A}^\dagger \mathbf{A}\]

These ensure \(\mathbf{A}^\dagger\) gives the minimum norm solution when multiple solutions exist. But how do we actually compute it when \(\mathbf{A}\) is rank deficient?

Computing the Pseudoinverse via SVD

Enter the Singular Value Decomposition (SVD). It’s one of the most powerful tools in numerical linear algebra, and it gives us a direct way to compute the pseudoinverse even when \(\mathbf{A}\) is rank deficient.

Here’s the idea: every matrix \(\mathbf{A} \in \mathbb{R}^{m \times n}\) can be decomposed as:

\[\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\]

where:

  • \(\mathbf{U} \in \mathbb{R}^{m \times m}\) is orthogonal (its columns are orthonormal)
  • \(\mathbf{V} \in \mathbb{R}^{n \times n}\) is orthogonal (its columns are orthonormal)
  • \(\Sigma \in \mathbb{R}^{m \times n}\) is diagonal with structure:
\[\Sigma = \begin{bmatrix} \Sigma_r & 0 \\ 0 & 0 \end{bmatrix}\]

where \(\Sigma_r = \text{diag}(\sigma_1, \sigma_2, \ldots, \sigma_r)\) with \(\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0\).

The values \(\sigma_i\) are the singular values. The number of non-zero singular values equals the rank of \(\mathbf{A}\). They tell you how much the matrix stretches space in each direction.

Now here’s the elegant part. Using SVD, the pseudoinverse is simply:

\[\mathbf{A}^\dagger = \mathbf{V}\Sigma^\dagger\mathbf{U}^T\]

where

\[\Sigma^\dagger = \begin{bmatrix} \Sigma_r^{-1} & 0 \\ 0 & 0 \end{bmatrix}\]

We invert only the non-zero singular values. The zeros stay zero. This elegantly handles rank deficiency: we can’t divide by zero, so we don’t.

Special case: When \(\mathbf{A}\) is square and full rank (\(\text{rank}(\mathbf{A}) = m = n\)), then \(\mathbf{A}^\dagger = \mathbf{A}^{-1}\). The pseudoinverse becomes the regular inverse.

The Condition Number: When Can You Trust Your Solution?

Here’s an uncomfortable truth: just because you can compute a solution doesn’t mean you should trust it.

Computing \(\mathbf{A}^T\mathbf{A}\) for the normal equations squares the condition number of \(\mathbf{A}\). This amplifies numerical errors. Small uncertainties in your measurements can explode into huge errors in your solution. SVD avoids forming \(\mathbf{A}^T\mathbf{A}\), which is why it’s often more reliable.

But what is this “condition number” anyway?

The condition number measures how sensitive your solution is to tiny changes in the input:

\[\kappa(\mathbf{A}) = \frac{\max_{\|\mathbf{x}\| = 1} \|\mathbf{A}\mathbf{x}\|}{\min_{\|\mathbf{x}\| = 1} \|\mathbf{A}\mathbf{x}\|}\]

For the 2-norm, this has a beautiful interpretation using singular values:

\[\kappa_2(\mathbf{A}) = \frac{\sigma_{\max}}{\sigma_{\min}} = \frac{\sigma_1}{\sigma_r}\]

It’s the ratio of the largest to smallest singular value. A large condition number means some directions get stretched way more than others. Your matrix is nearly singular in some direction.

What this means in practice:

  • \(\kappa(\mathbf{A}) \approx 10^k\) → lose about k digits of accuracy in your answer
  • \(\kappa(\mathbf{A}) < 100\) → well-conditioned, trust your solution
  • \(\kappa(\mathbf{A}) > 10^{10}\) → ill-conditioned, be very suspicious

Imagine you’re estimating channel coefficients in a communication system. Your measurement matrix has \(\kappa(\mathbf{A}) = 10^8\). Even if your sensor is accurate to 12 decimal places, your estimated coefficients are only good to about 4 decimal places. The remaining 8 digits? Amplified noise.

What do you do? Use truncated SVD: throw away the smallest singular values when computing \(\mathbf{A}^\dagger\). You’re deliberately ignoring the directions where \(\mathbf{A}\) barely stretches space. Those directions just amplify noise anyway.

Putting It All Together

So which method should you use? It depends on your matrix and how much you trust your data.

If your matrix is well-conditioned and has full rank, use the normal equations:

\[\mathbf{x} = (\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T\mathbf{b}\]

This is fast to compute. Form \(\mathbf{A}^T\mathbf{A}\), invert it, done. But remember: you’re squaring the condition number. Only do this when \(\kappa(\mathbf{A})\) is small.

If your matrix is rank deficient or poorly conditioned, use SVD-based pseudoinverse:

\[\mathbf{x} = \mathbf{A}^\dagger\mathbf{b} = \mathbf{V}\Sigma^\dagger\mathbf{U}^T\mathbf{b}\]

This costs more computationally but is much more numerically stable. The SVD reveals the true rank and handles singularities gracefully.

If you suspect your measurements are noisy, use truncated SVD. Drop the smallest singular values when forming \(\mathbf{A}^\dagger\). You’re trading some bias for dramatically reduced variance. The smallest singular values amplify noise anyway, so throwing them out often improves your actual solution quality even though mathematically it’s “less optimal.”

The least squares method underpins regression, machine learning, signal processing, and control systems. But the devil is in the numerical details. Understanding when to use which variant makes the difference between results you can trust and numerical nonsense.

References

  1. Stephen M. Kogon, Dimitris G. Manolakis, Vinay K. Ingle. “Statistical and Adaptive Signal Processing”. Artech House, 2005.
  2. Proofs involving the Moore-Penrose inverse.
  3. S. Kacem. “Kompensation von frequenzselektiver I/Q-Imbalance in breitbandigen Direktmisch-Sendern”. Studienarbeit 2011.
  4. Mathpedia, Methode der kleinsten Quadrate