Understanding The Hypergeometric Distribution | by Egor Howell | Jun, 2023


Breaking down one of the less well-known distributions in data science

Photo by Roth Melinda on Unsplash

The binomial distribution is a well-known distribution in and outside of data science. However, have you heard about its less popular cousin the hypergeometric distribution? Well if not, this post will give you a detailed explanation of what it is and why it is useful for us data scientists.

The hypergeometric distribution measures the probability of k success in n number of trials (a sample), without replacement, given some information about the population. This is very similar to the binomial distribution bar the one key difference of sampling without replacement. Due to this, the probability of each success (or outcome) changes every draw/trial, whereas in the binomial distribution the probability of a success (and failure) is fixed.

An easy-to-understand example is determining the probability of drawing all 4 kings in a random sample of 20 cards from a standard deck of cards. If we draw a king, the probability of drawing the subsequent king will be different from the first as the population composition has changed. Thus, the probability of a success is dynamic.

The probability mass function (PMF) of the hypergeometric distribution looks like this:

Equation in LaTeX by author.

Where:

  • n is the number of trials
  • k is the number of successes
  • N is the population size
  • K is the total number of successes in the population
  • X is a random variable from the hypergeometric distribution

The interested reader can find a derivation of the PMF here.

The bracket-like notation refers to the binomial coefficient:

Equation in LaTeX by author.

The factorials indicate we are dealing with combinations and permutations. You can read more about them in my previous blog here:

The mean of the distribution is given by:

Equation in LaTeX by author.

Let’s go back to our previous example of drawing 4 kings in a random 20 card sample from a regular deck of cards. The information we have is:

  • N = 52: Number of cards in the deck
  • n = 20: Number of cards we sample
  • k = 4: Number of kings we want (successes)
  • K = 4: Number of kings in the deck

Plugging these numbers into the PMF:

Equation in LaTeX by author.

Therefore, the probability is very low. This makes sense as the probability of selecting a king from a deck is ~0.077 (1/13), thus with an even smaller sample, this will diminish further as we have shown above.

If you want to play around with some numbers and different scenarios, I have linked here a hypergeometric distribution calculator.

The above example is a useful demonstration of the application of the hypergeometric distribution. However, we can get a fuller picture by plotting the PMF as a function of the number of successes k.

Below is a plot, in Python, for our above example where we vary the number of kings, k, we desire:

GitHub Gist by author.
Plot generated by author in Python.

As we can see, the probability of getting 5 kings from the 20-card sample is 0, as there aren’t five kings in the deck! The most likely number of kings we will get is 1.

Let’s now consider a new problem. What is the hypergeometric distribution of the number of spades-suited cards from a random 30-card sample?

GitHub Gist by author.
Plot generated by author in Python.

The most likely number of spades we will get is 8 in the 30-card sample. It is also virtually impossible for us to get no spades in the sample as shown by the plot.

The hypergeometric distribution touches many fields including:

  • Probability of winning a hand in poker
  • Voting populations analysis
  • Quality control in manufacturing
  • Genetic variations within a population

Therefore, the hypergeometric is something you will most likely come across in your data science career and thus is worth knowing about.

In this article, we have discussed the hypergeometric distribution. This is very similar to the binomial distribution but the probability of success changes as we are sampling without replacement. This distribution is very powerful within data science and has applications in quality control and the gambling industries. Therefore, it is well worth knowing as a data scientist

The full code is available at my GitHub here:

(All emojis designed by OpenMoji — the open-source emoji and icon project. License: CC BY-SA 4.0)



Source link

Leave a Comment