Theoretical Deep Dive into Linear Regression | by Dr. Robert Kübler | Jun, 2023

You can use any other prior distribution for your parameters to create more interesting regularizations. You can even say that your parameters w are normally distributed but correlated with some correlation matrix Σ.

Let us assume that Σ is positive-definite, i.e. we are in the non-degenerate case. Otherwise, there is no density p(w).

If you do the math, you will find out that we then have to optimize

Image by the author.

for some matrix Γ. Note: Γ is invertible and we have Σ⁻¹ = ΓᵀΓ. This is also called Tikhonov regularization.

Hint: start with the fact that

Image by the author.

and remember that positive-definite matrices can be decomposed into a product of some invertible matrix and its transpose.

Great, so we defined our model and know what we want to optimize. But how can we optimize it, i.e. learn the best parameters that minimize the loss function? And when is there a unique solution? Let’s find out.

Ordinary Least Squares

Let us assume that we do not regularize and don’t use sample weights. Then, the MSE can be written as

Image by the author.

This is quite abstract, so let us write it differently as

Image by the author.

Using matrix calculus, you can take the derivative of this function with respect to w (we assume that the bias term b is included there).

Image by the author.

If you set this gradient to zero, you end up with

Image by the author.

If the (n × k)-matrix X has a rank of k, so does the (k × k)-matrix XX, i.e. it is invertible. Why? It follows from rank(X) = rank(XX).

In this case, we get the unique solution

Image by the author.

Note: Software packages do not optimize like this but instead use gradient descent or other iterative techniques because it is faster. Still, the formula is nice and gives us some high-level insights about the problem.

But is this really a minimum? We can find out by computing the Hessian, which is XX. The matrix is positive-semidefinite since wXXw = |Xw|² ≥ 0 for any w. It is even strictly positive-definite since XX is invertible, i.e. 0 is not an eigenvector, so our optimal w is indeed minimizing our problem.

Perfect Multicollinearity

That was the friendly case. But what happens if X has a rank smaller than k? This might happen if we have two features in our dataset where one is a multiple of the other, e.g. we use the features height (in m) and height (in cm) in our dataset. Then we have height (in cm) = 100 * height (in m).

It can also happen if we one-hot encode categorical data and do not drop one of the columns. For example, if we have a feature color in our dataset that can be red, green, or blue, then we can one-hot encode and end up with three columns color_red, color_green, and color_blue. For these features, we have color_red + color_green + color_blue = 1, which induces perfect multicollinearity as well.

In these cases, the rank of XX is also smaller than k, so this matrix is not invertible.

End of story.

Or not? Actually, no, because it can mean two things: (XX)w = Xy has

  1. no solution or
  2. infinitely many solutions.

It turns out that in our case, we can obtain one solution using the Moore-Penrose inverse. This means that we are in the case of infinitely many solutions, all of them giving us the same (training) mean squared error loss.

If we denote the Moore-Penrose inverse of A by A⁺, we can solve the linear system of equations as

Image by the author.

To get the other infinitely many solutions, just add the null space of XX to this specific solution.

Minimization With Tikhonov Regularization

Recall that we could add a prior distribution to our weights. We then had to minimize

Image by the author.

for some invertible matrix Γ. Following the same steps as in ordinary least squares, i.e. taking the derivative with respect to w and setting the result to zero, the solution is

Image by the author.

The neat part:

XᵀX + ΓᵀΓ is always invertible!

Let us find out why. It suffices to show that the null space of XX + ΓᵀΓ is only {0}. So, let us take a w with (XX + ΓᵀΓ)w = 0. Now, our goal is to show that w = 0.

From (XX + ΓᵀΓ)w = 0 it follows that

Image by the author.

which in turn implies |Γw| = 0 → Γw = 0. Since Γ is invertible, w has to be 0. Using the same calculation, we can see that the Hessian is also positive-definite.

Source link

Leave a Comment