Welcome back to ‘Courage to Learn ML: Unraveling L1 & L2 Regularization,’ in its fourth post. Last time, our mentor-learner pair explored the properties of L1 and L2 regularization through the lens of Lagrange Multipliers.
In this concluding segment on L1 and L2 regularization, the duo will delve into these topics from a fresh angle — Bayesian priors. We’ll also summarize how L1 and L2 regularizations are applied across different algorithms.
In this article, we’ll address several intriguing questions. If any of these topics spark your curiosity, you’ve come to the right place!
- How MAP priors relate to L1 and L2 regularizations
- An intuitive breakdown of using Laplace and normal distributions as priors
- Understanding the sparsity induced by L1 regularization with a Laplace prior
- Algorithms that are compatible with L1 and L2 regularization
- Why L2 regularization is often referred to as ‘weight decay’ in neural network training
- The reasons behind the less frequent use of L1 norm in neural networks
Let’s dive into how different priors in the MAP formula shape our approach to L1 and L2 regularization (for a detailed walkthrough on formulating this equation, check out this post).
When considering priors for weights, Our initial intuition often leads us to choose a normal distribution as the prior for model weights. With this, we typically use a zero-mean normal distribution for each weight wi, sharing the same standard deviation 𝜎. Plugging this belief into the prior term logp(w) in MAP (where p(w) represents the weight’s prior) leads us to sum of squared weights naturally. This term is precisely the L2…