The binomial distribution is a well-known distribution in and outside of data science. However, have you heard about its less popular cousin the hypergeometric distribution? Well if not, this post will give you a detailed explanation of what it is and why it is useful for us data scientists.
The hypergeometric distribution measures the probability of k success in n number of trials (a sample), without replacement, given some information about the population. This is very similar to the binomial distribution bar the one key difference of sampling without replacement. Due to this, the probability of each success (or outcome) changes every draw/trial, whereas in the binomial distribution the probability of a success (and failure) is fixed.
An easy-to-understand example is determining the probability of drawing all 4 kings in a random sample of 20 cards from a standard deck of cards. If we draw a king, the probability of drawing the subsequent king will be different from the first as the population composition has changed. Thus, the probability of a success is dynamic.
The probability mass function (PMF) of the hypergeometric distribution looks like this:
- n is the number of trials
- k is the number of successes
- N is the population size
- K is the total number of successes in the population
- X is a random variable from the hypergeometric distribution
The interested reader can find a derivation of the PMF here.
The bracket-like notation refers to the binomial coefficient:
The factorials indicate we are dealing with combinations and permutations. You can read more about them in my previous blog here:
The mean of the distribution is given by:
Let’s go back to our previous example of drawing 4 kings in a random 20 card sample from a regular deck of cards. The information we have is:
- N = 52: Number of cards in the deck
- n = 20: Number of cards we sample
- k = 4: Number of kings we want (successes)
- K = 4: Number of kings in the deck
Plugging these numbers into the PMF:
Therefore, the probability is very low. This makes sense as the probability of selecting a king from a deck is ~0.077 (1/13), thus with an even smaller sample, this will diminish further as we have shown above.
If you want to play around with some numbers and different scenarios, I have linked here a hypergeometric distribution calculator.
The above example is a useful demonstration of the application of the hypergeometric distribution. However, we can get a fuller picture by plotting the PMF as a function of the number of successes k.
Below is a plot, in Python, for our above example where we vary the number of kings, k, we desire:
As we can see, the probability of getting 5 kings from the 20-card sample is 0, as there aren’t five kings in the deck! The most likely number of kings we will get is 1.
Let’s now consider a new problem. What is the hypergeometric distribution of the number of spades-suited cards from a random 30-card sample?
The most likely number of spades we will get is 8 in the 30-card sample. It is also virtually impossible for us to get no spades in the sample as shown by the plot.
The hypergeometric distribution touches many fields including:
- Probability of winning a hand in poker
- Voting populations analysis
- Quality control in manufacturing
- Genetic variations within a population
Therefore, the hypergeometric is something you will most likely come across in your data science career and thus is worth knowing about.
In this article, we have discussed the hypergeometric distribution. This is very similar to the binomial distribution but the probability of success changes as we are sampling without replacement. This distribution is very powerful within data science and has applications in quality control and the gambling industries. Therefore, it is well worth knowing as a data scientist
The full code is available at my GitHub here: