# Statistical Experiments With Resampling | Towards Data Science

## Introduction

Most people working with data make observations and then wonder whether these observations are statistically significant. And unless one has some formal training on statistical inference and past experience in running significance tests, the first thought that comes to mind is to find a statistician who can provide advice on how to conduct the test, or at least confirm that the test has been executed correctly and that the results are valid.

There are many reasons for this. For a start, it is often not immediately obvious which test is needed, which formulas underpin the test principles, how to use the formulas, and whether the test can be used in the first place, e.g. because the data do not fulfil necessary conditions such as normality. There are comprehensive R and Python packages for the estimation of a wealth of statistical models and for conducting statistical tests, such as statsmodels.

Still, without full appreciation of the statistical theory, using a package by replicating an example from the user guide often leaves a lingering sense of insecurity, in anticipation of severe criticism once the approach is scrutinised by a seasoned statistician. Personally, I am an engineer that turned into a data analyst over time. I had statistics courses during my undergraduate and postgraduate studies, but I did not use statistics extensively because this is not typically what an engineer does for a living. I believe the same applies to many other data analysts and data scientists, particularly if their formal training is for example in engineering, computer science or chemistry.

I decided to write this article because I came recently to the realisation that simulation can be readily used in place of more classical formula-based statistical methods. Most people would probably think immediately of bootstrapping to estimate the uncertainly of the mean. But it is not only about bootstrapping. Using resampling within random permutation tests can provide answers to many statistical inference problems. Such tests are generally not very difficult to write and execute. They apply universally to continuous or binary data, regardless of sample sizes and without making assumptions about the data distribution. In this sense, permutation tests are non-parametric and the only requirement is exchangeability, i.e. the probability to observe a certain sequence of values is the same for any permutation of the sequence. This is really not much to ask.

The unavailability of computing resources was perhaps one of the reasons for the impressive advancement of formula-based statistical inference tests in the past. Resampling thousands of times a data sample with tens or thousands of records was prohibitive back then, but it is not prohibitive anymore. Does this mean that classical statistical inference methods are not needed any more? Of course not. But having the ability to run a permutation test and confirm the results can be re-assuring when the results are similar, or help understand which assumptions do not hold when we observe discrepancies. Being able to run a statistical test from scratch without relying on a package also gives some sense of empowerment.

Permutation tests are of course nothing new, but I thought it is a good idea to provide some examples and the corresponding code. This may alleviate the fear of some data experts out there and bring statistical inference using simulation closer to their everyday practice. The article uses permutation tests for answering two questions. There are many more scenarios when a permutation test can be used and for more complex questions the design of a permutation test may not be immediately obvious. In this sense, this article is not comprehensive. However, the principles are the same. By understanding the basics it will be easier to look up an authoritative source on how to design a permutation test for answering other, more nuanced, business questions. My intention is to trigger a way of thinking where simulating the population distribution is at the centre and using the theoretical draws allows estimating what is the probability of an observed effect to occur by chance. This is what hypothesis tests are about.

Statistical inference starts with a hypothesis, e.g. a new drug is more effective against a given disease compared to the traditional treatment. Effectiveness could be measured by checking the reduction of a given blood index (continuous variable) or by counting the number of animals in which disease cannot be detected following treatment (discrete variable) when using the new drug and the traditional treatment (control). Such two-group comparisons, also known as A/B tests, are discussed extensively in all classical statistics texts and in popular tech blogs such as this one. Using the drug design example, we will test if the new drug is more effective compared to the traditional treatment (A/B testing). Building on this, we will estimate how many animals we need to establish that the new drug is more effective assuming that in reality it is 1% more effective (or for another effect size) than the traditional treatment. Although the two questions seem unrelated, they are not. We will be reusing code from the first to answer the second. All code can be found in my blog repository.

I welcome comments, but please be constructive. I do not pretend to be a statistician and my intention is to help others go through a similar learning process when it comes to permutation tests.

## A/B testing

Let’s come back to the first question, i.e. whether the new drug is more effective than the traditional treatment. When we run an experiment, ill animals are assigned to two groups, depending on which treatment they receive. The animals are assigned to groups randomly and hence any observed difference in the treatment efficacy is because of drug effectiveness, or because it just happened by chance that the animals with the stronger immune system were assigned to the new drug group. These are the two situations that we need to untangle. In other words, we want to examine if random chance can explain any observed benefits in using the new drug.

Let’s come up with some imaginary numbers to make an illustration: