You can use any other prior distribution for your parameters to create more interesting regularizations. You can even say that your parameters *w* are normally distributed but **correlated** with some correlation matrix Σ*.*

Let us assume that Σ ispositive-definite, i.e. we are in the non-degenerate case. Otherwise, there is no densityp(w).

If you do the math, you will find out that we then have to optimize

for some matrix Γ. **Note: Γ is invertible and we have Σ⁻¹ = ΓᵀΓ. **This is also called **Tikhonov regularization**.

**Hint:** start with the fact that

and remember that positive-definite matrices can be decomposed into a product of some invertible matrix and its transpose.

Great, so we defined our model and know what we want to optimize. But how can we optimize it, i.e. learn the best parameters that minimize the loss function? And when is there a unique solution? Let’s find out.

## Ordinary Least Squares

Let us assume that we do not regularize and don’t use sample weights. Then, the MSE can be written as

This is quite abstract, so let us write it differently as

Using matrix calculus, you can take the derivative of this function with respect to *w *(we assume that the bias term *b* is included there)*.*

If you set this gradient to zero, you end up with

If the (*n *× *k*)-matrix *X* has a rank of *k*, so does the (*k *× *k*)-matrix *X*ᵀ*X, *i.e. it is invertible*. Why? *It follows from rank(*X*)* = *rank(*X*ᵀ*X*)*.*

In this case, we get the **unique solution**

Note:Software packages do not optimize like this but instead use gradient descent or other iterative techniques because it is faster. Still, the formula is nice and gives us some high-level insights about the problem.

But is this really a minimum? We can find out by computing the Hessian, which is *X*ᵀ*X. *The matrix is positive-semidefinite since *w*ᵀ*X*ᵀ*Xw = |Xw|² *≥ 0 for any *w*. It is even **strictly** positive-definite since *X*ᵀ*X* is invertible, i.e. 0 is not an eigenvector, so our optimal *w* is indeed minimizing our problem.

## Perfect Multicollinearity

That was the friendly case. But what happens if *X* has a rank smaller than *k*? This might happen if we have two features in our dataset where one is a multiple of the other, e.g. we use the features *height (in m)* and *height (in cm)* in our dataset. Then we have *height (in cm) = 100 * height (in m).*

It can also happen if we one-hot encode categorical data and do not drop one of the columns. For example, if we have a feature *color* in our dataset that can be red, green, or blue, then we can one-hot encode and end up with three columns *color_red, color_green,* and *color_blue*. For these features, we have *color_red + color_green + color_blue = *1, which induces perfect multicollinearity as well.

In these cases, the rank of *X*ᵀ*X *is also smaller than *k*, so this matrix is not invertible.

End of story.

Or not? Actually, no, because it can mean two things: (*X*ᵀ*X*)*w = X*ᵀ*y *has

- no solution or
- infinitely many solutions.

It turns out that in our case, we can obtain one solution using the Moore-Penrose inverse. This means that we are in the case of infinitely many solutions, all of them giving us the same (training) mean squared error loss.

If we denote the Moore-Penrose inverse of *A* by *A*⁺, we can solve the linear system of equations as

To get the other infinitely many solutions, just add the null space of *X*ᵀ*X *to this specific solution.

## Minimization With Tikhonov Regularization

Recall that we could add a prior distribution to our weights. We then had to minimize

for some invertible matrix Γ. Following the same steps as in ordinary least squares, i.e. taking the derivative with respect to *w* and setting the result to zero, the solution is

The neat part:

XᵀX + ΓᵀΓ is always invertible!

Let us find out why. It suffices to show that the null space of *X*ᵀ*X* + ΓᵀΓ is only {0}. So, let us take a *w *with (*X*ᵀ*X* + ΓᵀΓ)*w* = 0. Now, our goal is to show that *w *= 0.

From (*X*ᵀ*X* + ΓᵀΓ)*w* = 0 it follows that

which in turn implies |Γ*w*| = 0 → Γ*w = *0*. *Since* *Γ is invertible, *w* has to be 0. Using the same calculation, we can see that the Hessian is also positive-definite.