One of my references in the Data Science field is Julia Silge. On her Tidy Tuesday videos she always makes a code-along type of video teaching/ showing a given technique, helping other analysts to upskill and incorporate that to their repertoire.
Last Tuesday, the topic was Empirical Bayes (her blog post), which caught my attention.
But, what is that?
Empirical Bayes is a statistical method used when we work with ratios like [success]/[total tries]. When we are working with such variables, many are the times when we face a 1/2 success, which translates to a 50% success percentage, or 3/4 (75%), 0/1 (0%).
Those extreme percentages do not represent the long term reality because there were so little tries that it makes it very hard to tell if there is a trend there, and most times these cases are just ignored or deleted. It takes more tries to tell what the real success rate is, like 30/60, 500/100, or whatever makes sense for a business.
Using Empirical Bayes, though, we are able to use the current data distribution to calculate an estimate for its own data in earlier or later stages, as we will see next in this post.
We use the data distribution to estimate earlier and later stages of each observation’s ratio.
Let’s jumps to the analysis. The steps to follow are:
- Load the data
- Define success and calculate the success ratio
- Determine the distribution’s parameters
- Calculate Bayes estimates
- Calculate the Credible Interval
Let’s move on.
import pandas as pd
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from distfit import distfit