In this article I’ll introduce the concept of a propensity score and what they’re used for before presenting 3 common methodologies. I’ll be discussing the following propensity score models:

*Propensity Score Matching with replacement*(PSM)*Propensity Score Matching without replacement*(PSM w/o)*Inverse Propensity Score Weighting*(IPSW)

The best way to evaluate the impact of a particular intervention or treatment is to run a randomised control trial (RCT). In an RCT you randomly split your population into 2 cohorts and apply the intervention to just one of them – this becomes your treatment group. The cohort that did not receive the intervention is your control. Due to the random assignments between control and treatment there should be no structural differences in characteristics between the two groups. If, after treatment, the treatment group behaves differently (i.e. converts) then we can conclude that this is the result of the intervention.

However, there are many scenarios in which it is not possible to run an RCT including but not limited to:

- Ethical reasons — e.g. product pricing needs to be kept consistent across users
- The treatment effect can’t be measured digitally, e.g. an advertising billboard
- Your tech stack means you can’t create 2 experiences

Propensity score modelling allows you to stratify your treatment and control to remove behavioural and demographic biases that might be acting as confounders

Propensity score modelling allows you to infer the causal relationship between an intervention and a response in situations where everyone has the *potential* to be exposed to your treatment. While everyone *could* be exposed to the treatment not everyone will — there will invariably be some people who aren’t. Obviously the behaviour of those who saw your intervention and those that did not is different – there’s a reason the treatment group was exposed and the control group was not. If the behaviour that lead to the treatment is correlated with conversion then you have a confounder. It is the characteristics of people in your treatment group that led to both being exposed to the intervention and the conversion. Propensity score methods allow you to stratify your treatment and control to remove behavioural and demographic biases that might be acting as confounders.

In the visual above, the teal people are more likely to have been exposed to marketing and so are over represented in the treatment group. If teal people are more likely to convert than the navy people then you will naturally expect more conversions from the treatment group. However, this is driven by the proportion of teal people and has nothing to do with whether someone saw the marketing.

A propensi

ty scoreis the probability that an individual will be exposed to the treatment. If we took 100 identical customers and 70 of them were exposed to the treatment then they would all have a propensity score of 0.7

The most common approach to calculating the propensity score is to fit a logistic regression classifier to predict the treatment group and then use the associated probabilities of being in the treatment group as the propensity score. The benefit of this approach is it is very simple to perform and avoids overfitting. The drawback is that the logistic regression classifier is trained to *classify* each sample as treatment or control, and the specific probabilities are only a proxy for the true propensity score.

There are 3 commonly used propensity score models and we’ll discuss the pros and cons of each in this section.

## Propensity Score matching with replacement

Each treatment sample is matched with the control sample that has the most similar characteristics as measured by the propensity score

In PSM you pair up each treatment sample with the control sample that has the most similar propensity score. Each control sample can be matched with multiple treatment samples and not every control sample ends up being matched.

In the above visual you can see that:

- The teal coloured person in the treatment gets matched with one of the teal coloured people in the control
- The 9 navy people in the treatment all get matched with the single navy person in the control. The result is that the matched control sample gets duplicated 9 times
- There are 8 people in the control that do not get matched (pale teal) and are then dropped from our analysis

## Pros

- Computationally efficient to do the matching process — typically you’d use a standard nearest neighbours algorithm for this

## Cons

- Control samples with multiple matches can bias the treatment effect

## Propensity Score matching without replacement

Each treatment sample is matched with a control sample that has the most similar characteristics but each control sample can only be matched once.

PSM w/o differs from PSM in how the matching is performed. Each control sample can only be matched to one treatment sample. The order that you perform the matching process can influence the model performance because the treatment samples that get matched first have a larger selection of control samples to choose from. The later treatment samples have to match with who ever is left in the control and these often do not have a very similar propensity score. To remove the variance introduced by the ordering of the matching you could bootstrap the process with different orders although this would introduce a large computational overheard.

In the above visual we can see:

- The teal person in the treatment gets matched with one of the teal people from the control
- The first navy person in the treatment gets matched with the only navy person on the control
- The remaining navy people in the treatment get matched with a teal person from the control — poor matching

## Pros

- Each person in the control appears at most once so no single sample can bias the treatment effect

## Cons

- Limited open source packages available for the matching process
- Less efficient than PSM due to having to keep track of which samples have already been matched
- Requires bootstrapping which increases computational overhead
- Poor matching performance could lead to residual bias in covariates

Each control sample is weighted so that the control group as a whole matches the treatment

In inverse propensity weighting we apply a weight to each sample in the control group based on their propensity score. The treatment samples gets a weight of 1. The weight of the *i*th control sample can be calculated using the formula:

where *pi *is the propensity score of the sample. The weights of the above form mean samples with characteristics that are under-represented in the control relative to the treatment (propensity score ~ 1) get weighted as more important with weights > 1. Characteristics that are over respresented in the control (propensity score ~ 0) get weighted less important with a weight < 1. In our toy example we could use the following weights:

In the above visual we can see:

- The 9 teal samples in the control contribute a combined weight of 1 to match the single teal sample in the treatment
- The single navy sample in the control receives 9 votes to match the 9 navy samples in the treatment

## Pros

- Highly efficient because we’re not matching but simply performing a vectorisable calculation for each sample independently

## Cons

- Samples that have a high weight can bias the treatment effect

## Covariate Bias

Covariate bias means that there is a measurable and statistically significant difference in characteristics between the treatment and control

It is highly likely that if you analysed the characteristics of your control and treatment groups that there would be differences. Maybe more engaged users are more likely to be exposed to the intervention and occur in your treatment. If you are trying to increase retention then it is likely that the more engaged users of your treatment group are going to retain longer than those in the control independently of the intervention— this is bias. Propensity score modelling, when applied properly, should remove bias between the control and treatment groups in all characteristics.

We can measure the bias associated with a particular covariate using the absolute standardised mean difference defined by

Here x̄ is the sample mean and the subscript *t *and *c *refer to the treatment and control, respectively. The |.| means take the absolute value. For continuous variables *s *is defined as

and for binary variables it is defined as

σ is the standard deviation and *n *is the number of samples. In IPSW we can replace *n — *1 with the sum of the weights and the x̄ is replaced by the weighted mean.

The definition of s is essentially the square root of the weighted average of the variance of the control and treatment. The weighted average uses *n — *1* *as the weight in a similar way to the unbiased standard deviation of the population.

If the *smd > *0.1* *for one of your covariates then we conclude that there is bias between the treatment and control and this covariate could be acting as a confounder. The threshold of 0.1 is equivalent to detecting a statistically significant result in an AB test using a 2-sided test at α = 0.05.

## Challenges with Propensity Scores

There are 2 issues with using a classifier to estimate the propensity scores:

: If we took 100 very similar customers and 70 of them were in the treatment then they should all have a propensity score of 0.7*Accuracy*If we took 100 customers with a propensity score of 0.7 then 70 of them should be in the treatment*Self-consistency:*

In order to fully remove bias from propensity score modelling you should have accurate and self-consistent propensity scores.

## Solving the Self-consistency Problem

Self-consistency is quite easy to achieve by rescaling the propensity scores. In the plot below I have rounded the propensity scores to the nearest 0.01 to place people into discrete buckets (*x*-axis). The proportion of samples in each bucket that are in the treatment is then shown on the *y*-axis.

In general, the logistic regression under-estimates the propensity scores at higher values and over-estimates at lower values. Roughly 70% of people with a propensity score of 0.6 are in the treatment — by definition it should be 60%.

I have also included a `tanh`

curve that has been fitted to the relationship. By passing the probabilities from the logistic regression through this `tanh`

function we can rescale the propensity scores to achieve self-consistency. Now people with a propensity score of 0.6 would be rescaled to 0.7 and we can see that 70% of those would be in the treatment.

However, this approach assumes that people with similar propensity scores have similar underlying characteristics — although broadly speaking this is true it isn’t necessarily accurate enough to remove bias.

## Solving the Accuracy Problem

Accuracy is a challenge because we don’t have any ground truth propensity scores to train a regressor on, which is why we trained a classifier. We also are unable to measure how accurate the propensity scores from the classifier are without ground truth values to compare them to. Accuracy is essentially a latent variable that we’re required to assume if we are unable to prove that our propensity scores are inaccurate. For example, if the propensity scores are self-consistent and they remove covariate bias then they’re probably good enough to use.

Below is a table that summarises the pros and cons of each method. I have defined three characteristics that are important for propensity modelling which are:

**Matching independence**—the matching of each treatment sample can be performed independently of the others. This characteristic encapsulates the computational efficiency of the method**No over-matching**— No control sample gets matched or weighted so heavily that there is a risk that they will dominate the control’s conversion-rate**Low diversity—**A control group that contains just a small number of samples that have been matched many times each. Low diversity is similar to over-matching but many control samples have been duplicated. Depending on your threshold for over-matching it’s possible to have low diversity without over-matching.

Both PSM and IPSW have independent matching and are therefore computationally efficient methods to apply. There is little support for the implementation of PSM w/o and bespoke implementation on top of the sequential matching and bootstrapping makes it a challenging method to use in practice.

PSM w/o is the only method that completely avoids over-matching because it uses one-to-one matching. However this may come at the detriment of the bias reduction due to making poor matches. Over-matching, although present in both PSM and IPSW, is actually caused by different pathways. In IPSW, over-matching only occurs at high propensity scores where the proportion of control samples is very small and each gets assigned a large weight. If the propensity scores in your full sample only range from, say, 0.2–0.8 then you will not encounter over-matching in IPSW. In PSM, over-matching can occur at any propensity score as it is more related to the exact values and distributions of the propensity scores in the control and treatment.

PSM can easily suffer from low diversity with a few control samples looking the most like the treatment and therefore being matched multiple times each. PSM w/o will retain the full control group (if the control and treatment are of equal size) but this may lead to poor matching and therefore fail to remove bias. ISPW retains the entire control group and, because it can both increase and decrease the weights of the control samples, is able to achieve a diverse control group while also effectively removing bias.

In this article I have presented and discussed 3 approaches to using propensity scores in causal inference. Propensity score matching with replacement gets the most attention in the industry due to its ease in implementation although without proper analysis can result in a failure to appropriately remove bias and model the true treatment effect. Propensity score modelling without replacement is often computationally prohibitive and can fundamentally fail to remove bias, although it does offer several advantages in generating a diverse control group. Inverse propensity score weighting is a very overlooked method that combines the computational efficiency of PSM with the robustness of PSM w/o. If your propensity scores don’t get too close to 1 then it is by far the superior method of the 3.