## p-value

Enters the infamous p-value. It’s a number that answers the question: what’s the probability of observing the chi-2 value we got or an even more extreme one, given that the null hypothesis is true? Or, using some notation, the p-value represents the probability of observing the data assuming the null hypothesis is true: P(data|H₀) (To be precise, the p-value is defined as P(test_static(data) > T | H₀), where T is the chosen threshold for the test statistic). Notice how this is different from what we are actually interested in, which is the probability that our hypothesis is true given the data we have observed: P(H₀|data).

**what p-value represents: P(data|H₀)what we usually want: P(H₀|data)**

Graphically speaking, the p-value is the sum of the blue probability density to the right of the red line. The easiest way to compute it is to calculate one minus the cumulative distribution at the observed value, that is one minus the probability mass on the left side.

`1 - chi2.cdf(chisq, df=1)`

This gives us 0.0396. If there was no data drift, we would get the test statistic we’ve got or an even larger one in roughly 4% of the cases. Not that seldom, after all. In most use cases, the p-value is conventionally compared to the significance level of 1% or 5%. If it’s lower than that, one rejects the null. Let’s be conservative and follow the 1% significance threshold. In our case with a p-value of almost 4%, there is not enough evidence to reject it. Hence, no data drift was detected.

To ensure that our test was correct, let’s confirm it with scipy’s built-in test function.

`from scipy.stats import chi2_contingency`chisq, pvalue, df, expected = chi2_contingency(cont_table)

print(chisq, pvalue)

`4.232914541135393 0.03964730311588313`

This is how hypothesis testing works. But how relevant is it for data drift detection in a production machine learning system?

Statistics, in its broadest sense, is the science of making inferences about entire populations based on small samples. When the famous t-test was first published at the beginning of the 20th century, all calculations were made with pen and paper. Even today, students in STATS101 courses will learn that a “large sample” starts from 30 observations.

Back in the days when data was hard to collect and store, and manual calculations were tedious, statistically rigorous tests were a great way to answer questions about the broader populations. Nowadays, however, with often abundant data, many tests diminish in usefulness.

The characteristic is that many statistical tests treat the amount of data as evidence. With less data, the observed effect is more prone to random variation due to sampling error, and with a lot of data, its variance decreases. Consequently, the exact same observed effect is stronger evidence against the null hypothesis with more data than with less.

To illustrate this phenomenon, consider comparing two companies, A and B, in terms of the gender ratio among their employees. Let’s imagine two scenarios. First, let’s take random samples of 10 employees from each company. At company A, 6 out of 10 are women while at company B, 4 out of 10 are women. Second, let’s increase our sample size to 1000. At company A, 600 out of 1000 are women, and at B, it’s 400. In both scenarios, the gender ratios were the same. However, more data seems to offer stronger evidence for the fact that company A employs proportionally more women than company A, doesn’t it?

This phenomenon often manifests in hypothesis testing with large data samples. The more data, the lower the p-value, and so the more likely we are to reject the null hypothesis and declare the detection of some kind of statistical effect, such as data drift.

Let’s see whether this holds for our chi-2 test for the difference in frequencies of a categorical variable. In the original example, the serving set was roughly ten times smaller than the training set. Let’s multiply the frequencies in the serving set by a set of scaling factors between 1/100 and 10 and calculate the chi-2 statistic and the test’s p-value each time. Notice that multiplying all frequencies in the serving set by the same constant does not impact their distribution: the only thing we are changing is the size of one of the sets.

`training_freqs = np.array([10_322, 24_930, 30_299])`

serving_freqs = np.array([1_015, 2_501, 3_187])p_values, chi_sqs = [], []

multipliers = [0.01, 0.03, 0.05, 0.07, 0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10]

for serving_size_multiplier in multipliers:

augmented_serving_freqs = serving_freqs * serving_size_multiplier

cont_table = pd.DataFrame([

training_freqs,

augmented_serving_freqs,

])

chi_sq, pvalue, _, _ = chi2_contingency(cont_table)

p_values.append(pvalue)

chi_sqs.append(chi_sq)

The values at the multiplier equal to one are the ones we’ve calculated before. Notice how with a serving size just 3 times larger (marked with a vertical dashed line) our conclusion changes completely: we get the chi-2 statistic of 11 and the p-value of almost zero, which in our case corresponds to indicating data drift.

The consequence of this is the increasing amount of false alarms. Even though these effects will be statistically significant, they will not necessarily be significant from the performance monitoring point of view. With a large enough data set, even the tiniest of data drifts will be indicated even if it is so weak that it doesn’t deteriorate the model’s performance.

Having learned this, you might be tempted to suggest dividing the serving data into a number of chunks and running multiple tests with smaller data sets. Unfortunately, this is not a good idea either. To understand why, we need to deeply understand what the p-value really means.

We have already defined the p-value as the probability of observing the test statistic at least as unlikely as the one we have actually observed, given that the null hypothesis is true. Let’s try to unpack this mouthful.

The null hypothesis means no effect, in our case: no data drift. This means that whatever differences there are between the training and serving data, they have emerged as a consequence of random sampling. The p-value can therefore be seen as the probability of getting the differences we got, given that they only come from randomness.

Hence, our p-value of roughly 0.1 means that in the complete absence of data drift, 10% of tests will erroneously signal data drift due to random chance. This stays consistent with the notation for what the p-value represents which we introduced earlier: P(data|H₀). If this probability is 0.1, then given that H₀ is true (no drift), we have a 10% chance of observing the data at least as different as what we observed (according to the test statistic)

This is the reason why running more tests on smaller data samples is not a good idea: if instead of testing the serving data from the entire day each day, we would split it into 10 chunks and run 10 tests each day, we would end up with one false alarm every day, on average! This may lead to the so-called alert fatigue, a situation in which you are bombarded by alerts to the extent that you stop paying attention to them. And when data drift really does happen, you might miss it.

We have seen that detecting data drift based on a test’s p-value can be unreliable, leading to many false alarms. How can we do better? One solution is to go 180 degrees and resort to Bayesian testing, which allows us to directly estimate what we need, P(H₀|data), rather than the p-value, P(data|H₀).