In the previous section, we have seen two examples of loss functions that we would like to compute if we could observe the Individual Treatment Effects* τᵢ*. However, in practice, even with a randomized experiment and even with a validation set, we do not observe the ITE, our object of interest. We will now cover some measures that try to evaluate uplift models, given this practical constraint.

## Outcome Loss

The first and simplest approach is to switch to a different loss variable. While we cannot observe the Individual Treatment Effects, *τᵢ*, we can still observe our outcome *Yᵢ*. This is not exactly our object of interest, but we might expect an uplift model that performs well in terms of predicting *y* also to produce good estimates of *τ*.

One such loss function could be the **Outcome MSE loss**, which is the usual MSE loss function for prediction methods.

The problem here is that not all models directly produce an estimate of *μ(x)*. Therefore, we skip this comparison and switch to methods that can evaluate any uplift model.

## Prediction to Prediction Loss

Another very simple approach could be to compare the predictions of the model trained on the training set with the predictions of another model trained on the validation set. While intuitive, this approach could be **extremely misleading**.

`def loss_pred(data, learner):`

tau = learner.effect(data[X])

learner2 = copy.deepcopy(learner).fit(data[Y], data[W], X=data[X])

tau2 = learner2.effect(data[X])

return np.mean((tau - tau2)**2)

results = compare_methods(learners, names, loss_pred, 'Prediction to Prediction Loss')

Unsurprisingly, this metric performs extremely badly, and you should **never use it** since it rewards models that are consistent, irrespectively of their quality. A model that always predicts a random constant CATE for each observation would obtain a perfect score.

## Distribution Loss

A different approach is to ask: how well can we match the distribution of potential outcomes? We can do this exercise for either the *treated* or *untreated* potential outcomes. Let’s take the last case. Suppose we take the observed `sales`

for customers that did *not* receive the `mail`

and the observed `sales`

*minus* the estimated CATE *τ̂(x)* for customers that did receive the `mail`

. By the **unconfoundedness** assumption, these two distributions of the untreated potential outcome should be similar, conditional on covariates *X*.

Therefore, we expect the distance between the two distributions to be close if we have correctly estimated the treatment effects.

We can also do the same exercise for the *treated* potential outcome.

We use the energy distance as the distance metric.

`from dcor import energy_distance`def loss_dist(data, learner):

tau = learner.effect(data[X])

data.loc[data.mail==1, 'sales'] -= tau[data.mail==1]

return energy_distance(data.loc[data.mail==0, [Y] + X], data.loc[data.mail==1, [Y] + X], exponent=2)

results = compare_methods(learners, names, loss_dist, 'Distribution Loss')

This measure is extremely noisy and rewards the S-learner followed by the T-learner, which are actually the two worst-performing models.

## Above-below Median Difference

The above-below median loss tries to answer the question: is our uplift model detecting **any heterogeneity**? In particular, if we take the validation set and we split the sample into above-median and below-median predicted uplift *τ̂(x)*, how big is the actual difference in average effect, estimated with a difference-in-means estimator? We would expect better estimators to better split the sample into high-effects and low-effects.

`from statsmodels.formula.api import ols `def loss_ab(data, learner):

tau = learner.effect(data[X]) + np.random.normal(0, 1e-8, len(data))

data['above_median'] = tau >= np.median(tau)

param = ols('sales ~ mail * above_median', data=data).fit().params[-1]

return param

results = compare_methods(learners, names, loss_ab, title='Above-below Median Difference', subtitle='higher is better')

Unfortunately, the above-below median difference rewards the T-learner, which is among the worst-performing models.

It’s important to note that the difference-in-means estimators in the two groups (above- and below- median *τ̂(x)*) are **not guaranteed to be unbiased**, even if the data came from a randomized experiment. In fact, we have split the two groups on a variable, *τ̂(x)*, that is highly endogenous. Therefore, the method should be used with a grain of salt.

## Uplift Curve

An extension of the above-below median test is the **uplift curve**. The idea is simple: instead of splitting the sample into two groups based on the median (0.5 quantile), why not split the data into more groups (more quantiles)?

For each group, we compute the difference-in-means estimate, and we plot its cumulative sum against the corresponding quantile. The result is called the **uplift curve**. The interpretation is simple: the higher the curve, the better we are able to separate high- from low-effect observations. However, also the same **disclaimer** applies: the difference-in-means estimates are not unbiased. Therefore, they should be used with a grain of salt.

`def generate_uplift_curve(df):`

Q = 20

df_q = pd.DataFrame()

data = dgp.generate_data(seed_data=1, seed_assignment=1, keep_po=True)

ate = np.mean(data[Y][data[W]==1]) - np.mean(data[Y][data[W]==0])

for learner, name in zip(learners, names):

data['tau_hat'] = learner.effect(data[X])

data['q'] = pd.qcut(-data.tau_hat + np.random.normal(0, 1e-8, len(data)), q=Q, labels=False)

for q in range(Q):

temp = data[data.q <= q]

uplift = (np.mean(temp[Y][temp[W]==1]) - np.mean(temp[Y][temp[W]==0])) * q / (Q-1)

df_q = pd.concat([df_q, pd.DataFrame({'q': [q], 'uplift': [uplift], 'learner': [name]})], ignore_index=True)fig, ax = plt.subplots(1, 1, figsize=(8, 5))

sns.lineplot(x=range(Q), y=ate*range(Q)/(Q-1), color='k', ls='--', lw=3)

sns.lineplot(x='q', y='uplift', hue='learner', data=df_q);

plt.suptitle('Uplift Curve', y=1.02, fontsize=28, fontweight='bold')

plt.title('higher is better', fontsize=14, fontweight=None, y=0.96)

generate_uplift_curve(df)

While probably not the best method to *evaluate* uplift models, the uplift curve is very important in **understanding** and **implementing** them. In fact, for each model, it tells us that is the expected average treatment effect (y-axis) as we increase the share of the treated population (x-axis).

## Nearest Neighbor Match

The last couple of methods we analyzed used aggregated data in order to understand whether the methods work on larger groups. The nearest neighbor match tries instead to understand how well an uplift model predicts individual treatment effects. However, since the ITEs are not observable, it tries to build a **proxy by matching** treated and control observations on observable characteristics *X*.

For example, if we take all treated observations (*i: Wᵢ=1*), and we find the nearest neighbor in the control group (*NN₀(Xᵢ)*), the corresponding MSE loss function is

`from scipy.spatial import KDTree`def loss_nn(data, learner):

tau_hat = learner.effect(data[X])

nn0 = KDTree(data.loc[data[W]==0, X].values)

control_index = nn0.query(data.loc[data[W]==1, X], k=1)[-1]

tau_nn = data.loc[data[W]==1, Y].values - data.iloc[control_index, :][Y].values

return np.mean((tau_hat[data[W]==1] - tau_nn)**2)

results = compare_methods(learners, names, loss_nn, title='Nearest Neighbor Loss')

In this case, the nearest neighbor loss performs quite well, identifying the two worse performing methods, the S- and T-learner.

## IPW Loss

The Inverse Probability Weighting (IPW) loss function was first proposed by Gutierrez, Gerardy (2017), and it is the first of three metrics that we are going to see that use a **pseudo-outcome** *Y** to evaluate the estimator. Pseudo-outcomes are variables whose expected value is the Conditional Average Treatment Effect, but that are too volatile to be directly used as estimates. For a more detailed explanation of pseudo-outcomes, I suggest my article on causal regression trees. The pseudo-outcome corresponding to the IPW loss is

so that the corresponding loss function is

`def loss_ipw(data, learner):`

tau_hat = learner.effect(data[X])

e_hat = clone(model_e).fit(data[X], data[W]).predict_proba(data[X])[:,1]

tau_gg = data[Y] * (data[W] - e_hat) / (e_hat * (1 - e_hat))

return np.mean((tau_hat - tau_gg)**2)results = compare_methods(learners, names, loss_ipw, title='IPW Loss')

The IPW loss is extremely noisy. A solution is to use its more robust variations, the R-loss or the DR-loss which we present next.

## R Loss

The R-loss was introduced together with the R-learner by Nie, Wager (2017), and it is essentially the **objective function** of the R-learner. As for the IPW-loss, the idea is to try to match a pseudo outcome whose expected value is the Conditional Average Treatment Effect.

The corresponding loss function is

`def loss_r(data, learner):`

tau_hat = learner.effect(data[X])

y_hat = clone(model_y).fit(df[X + [W]], df[Y]).predict(data[X + [W]])

e_hat = clone(model_e).fit(df[X], df[W]).predict_proba(data[X])[:,1]

tau_nw = (data[Y] - y_hat) / (data[W] - e_hat)

return np.mean((tau_hat - tau_nw)**2)compare_methods(learners, names, loss_r, title='R Loss')

The R-loss is sensibly less noisy than the IPW loss, and it clearly isolates the S-learner. However, it tends to favor its corresponding learner, the R-learner.

## DR Loss

The DR-loss is the **objective function** of the DR-learner, and it was first introduced by Saito, Yasui (2020). As for the IPW- and the R-loss, the idea is to try to match a pseudo outcome, whose expected value is the Conditional Average Treatment Effect. The DR pseudo-outcome is strongly related to the AIPW estimator, also known as doubly-robust estimator, hence the DR name.

The corresponding loss function is

`def loss_dr(data, learner):`

tau_hat = learner.effect(data[X])

y_hat = clone(model_y).fit(df[X + [W]], df[Y]).predict(data[X + [W]])

mu1 = clone(model_y).fit(df[X + [W]], df[Y]).predict(data[X + [W]].assign(mail=1))

mu0 = clone(model_y).fit(df[X + [W]], df[Y]).predict(data[X + [W]].assign(mail=0))

e_hat = clone(model_e).fit(df[X], df[W]).predict_proba(data[X])[:,1]

tau_nw = mu1 - mu0 + (data[Y] - y_hat) * (data[W] - e_hat) / (e_hat * (1 - e_hat))

return np.mean((tau_hat - tau_nw)**2)results = compare_methods(learners, names, loss_dr, title='DR Loss')

As for the R-loss, the DR-loss tends to favor its corresponding learner, the DR-learner. However, it provides a more accurate ranking in terms of algorithms’ accuracy.

## Empirical Policy Gain

The last loss function that we are going to analyze is different from all the others we have seen so far since it does *not* focus on how well we are able to estimate the treatment effects but rather on how well would the corresponding **optimal treatment policy** performs. In particular, Hitsch, Misra, Zhang (2023) propose the following gain function:

where *c* is the treatment cost and *d* is the optimal treatment policy given the estimated CATE *τ̂(Xᵢ)*. In our case, we assume an individual treatment cost of c=0.01$, so that the optimal policy is to treat every customer with an estimated CATE larger than 0.01.

The terms *Wᵢ⋅d(τ̂)* and (1-*Wᵢ)⋅(1-d(τ̂))* imply that we use for the calculation only individuals for whom the actual treatment *W* corresponds with the optimal one, *d*.

`def gain_policy(data, learner):`

tau_hat = learner.effect(data[X])

e_hat = clone(model_e).fit(data[X], data[W]).predict_proba(data[X])[:,1]

d = tau_hat > cost

return np.sum((d * data[W] * (data[Y] - cost)/ e_hat + (1-d) * (1-data[W]) * data[Y] / (1-e_hat)))results = compare_methods(learners, names, gain_policy, title='Empirical Policy Gain', subtitle='higher is better')

The empirical policy gain performs very well, isolating the two worst-performing methods, the S- and T-learners.

In this article, we have introduced a wide variety of methods to evaluate uplift models, a.k.a. Conditional Average Treatment Effect estimators. We have also tested in our simulated dataset, which is a very special and limited example. How do these metrics **perform** in general?

Schuler, Baiocchi, Tibshirani, Shah (2018) compares the S-loss, T-loss, R-loss, on **simulated data**, for the corresponding estimators. They find that the R-loss “*is the validation set metric that, when optimized, most consistently leads to the selection of a high-performing model*”. The authors also detect the so-called **congeniality bias**: metrics such as the R- or DR-loss tend to be biased towards the corresponding learner.

Curth, van der Schaar (2023) studies a broader array of learners from a **theoretical perspective**. They find that “*no existing selection criterion is globally best across all experimental conditions we consider*”.

Mahajan, Mitliagkas, Neal, Syrgkanis (2023) is the **most comprehensive** study in terms of scope. The authors compare many metrics on 144 datasets and 415 estimators. They find that “*no metric significantly dominates the rest*” but “*metrics that use DR elements seem to always be among the candidate winners*”.