This tutorial explores how covariates influence A/B testing precision in a randomized experiment. A properly randomized A/B test calculates the lift by comparing the average outcome in the treatment and control groups. However, the influence of features other than the treatment on the outcome determines the statistical properties of the A/B test. For instance, omitting influential features in the test lift calculation can lead to a highly imprecise estimate of the lift, even if it converges to the true value as the sample size increases.

You will learn what RMSE, bias, and size of a test are and understand the performance of an A/B test through generating simulated data and running Monte Carlo experiments. This kind of work is helpful to understand how the properties of the Data Generating Process (DGP) influence A/B test performance and will help you take this understanding to run A/B tests on real-world data. First, we discuss some basic statistical properties of an estimator.

## Root Mean Square Error (RMSE)

RMSE (Root Mean Square Error): RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and observed values. It’s the square root of the average squared differences between prediction and actual observation. The formula for RMSE is:

RMSE = sqrt[(1/n) * Σ(actual – prediction)²]

RMSE gives a relatively high weight to large errors because they are squared before they are averaged, which means the RMSE should be more useful when large errors are undesirable.

## Bias

In statistics, the bias of an estimator is the difference between this estimator’s expected value and the true value of the estimated parameter. An estimator or decision rule with zero bias is called unbiased; otherwise, the estimator is said to be biased. In other words, a bias occurs when an algorithm consistently learns the same incorrect thing by failing to see the accurate underlying relationship.

For instance, if you are trying to predict house prices based on features of the house, and your predictions are consistently $100,000 below the actual price, your model is biased.

## Size

In hypothesis testing in statistics, “test size” refers to the significance level of the test, often denoted by the Greek letter α (alpha). The significance level, or test size, is a threshold a test statistic must surpass to reject a hypothesis.

It represents the probability of rejecting the null hypothesis when it is, in fact, true, which is a type of error known as a Type I error or false positive.

For example, if a test is set at a 5% significance level (α = 0.05), it means that there’s a 5% risk of rejecting the null hypothesis when it is true. This level, 0.05, is a common choice for α, although other levels, such as 0.01 or 0.10, can also be used depending on the context and the field of study.

The smaller the test size, the stronger the evidence required to reject the null hypothesis, reducing the likelihood of a Type I error but potentially increasing the chance of a Type II error (failing to reject the null hypothesis when it is false). The balance between Type I and Type II errors is an important consideration in the design of any statistical test.

## Empirical size

Empirical size in the context of hypothesis testing via Monte Carlo simulations refers to the proportion of times that the null hypothesis is incorrectly rejected across the simulations when the null hypothesis is true. This is essentially a simulated version of a Type I error rate.

Here’s a general process of how you would do this:

1. Set up your null hypothesis and choose a significance level for your test (e.g., α = 0.05).

2. Generate a large number of samples under the assumption that the null hypothesis is true. The number of samples is usually quite large, such as 10,000 or 100,000, to ensure the stability of the results.

3. For each sample, perform the hypothesis test and record whether the null hypothesis is rejected (you can record this as 1 for rejection and 0 for fail to reject).

4. Calculate the empirical size as the proportion of simulations where the null hypothesis was rejected. This estimates the probability of rejecting the null hypothesis when it’s true under the given testing procedure.

The code below shows how to implement this.

`import numpy as np`

from scipy.stats import ttest_1samp

import randomrandom.seed(10)

def calculate_empirical_size(num_simulations: int, sample_size: int, true_mean: float, significance_level: float) -> float:

"""

Simulates a set of samples and conducts a hypothesis test on each, then calculates the empirical size.

Parameters:

num_simulations (int): The number of simulations to run.

sample_size (int): The size of each simulated sample.

true_mean (float): The true mean under the null hypothesis.

significance_level (float): The significance level for the hypothesis tests.

Returns:

float: The empirical size, or the proportion of tests where the null hypothesis was rejected.

"""

import numpy as np

from scipy.stats import ttest_1samp

# Initialize counter for null hypothesis rejections

rejections = 0

# Run simulations

np.random.seed(0) # for reproducibility

for _ in range(num_simulations):

sample = np.random.normal(loc=true_mean, scale=1, size=sample_size)

t_stat, p_value = ttest_1samp(sample, popmean=true_mean)

if p_value < significance_level:

rejections += 1

# Calculate empirical size

empirical_size = rejections / num_simulations

return empirical_size

calculate_empirical_size(1000, 1000, 0, 0.05)

For each of the 1000 simulations, a random sample of size 1000 is drawn from a normal distribution with mean 0 and standard deviation 1. A one-sample t-test is conducted to test whether the sample mean significantly differs from the true mean (0 in this case). If the p-value of the test is less than the significance level (0.05), the null hypothesis is rejected.

The empirical size is calculated as the number of times the null hypothesis is rejected (the number of false positives) divided by the total number of simulations. This value should be close to the nominal significance level in a well-calibrated test. In this case, the function returns the empirical size, which gives an idea of how often you might incorrectly reject the null hypothesis when it is actually true in real-world applications, assuming the same conditions as in the simulation.

Due to random variation, the empirical size might not exactly match the nominal significance level, but they should be close if the sample size is large enough and the test assumptions are met. This difference between the empirical and nominal size is why simulation studies like this are conducted, to see how well the nominal size matches reality.

In this section, you will learn how the properties of the DGP and the inclusion of covariates in the estimation of an A/B test result influence the performance of your test. Using the code below, you will simulate A/B tests that follow a DGP with or without covariates, and you will see how the performance of your test varies depending on whether you include the covariates in your estimator.

`import numpy as np`

import random

from matplotlib import pyplot as plt

from tqdm import tqdm

import statsmodels.api as smdef fn_variance(data: list, ddof: int = 0) -> float:

"""

Calculate the variance of a given list of data.

Parameters:

data (list): The list of data to calculate the variance for.

ddof (int): Delta Degrees of Freedom. The divisor used in calculations is N - ddof. Default is 0.

Returns:

float: The variance of the data.

"""

n = len(data)

mean = sum(data) / n

return sum((x - mean) ** 2 for x in data) / (n - ddof)

def fn_generate_cov(dim: int) -> np.ndarray:

"""

Generate a covariance matrix of a given dimension.

Parameters:

dim (int): The dimension of the covariance matrix.

Returns:

np.ndarray: The generated covariance matrix.

"""

acc = []

for i in range(dim):

row = np.ones((1,dim)) * corr

row[0][i] = 1

acc.append(row)

return np.concatenate(acc,axis=0)

def fn_generate_multnorm(nobs: int, corr: float, nvar: int) -> np.ndarray:

"""

Generate a multivariate normal distribution.

Parameters:

nobs (int): The number of observations in the distribution.

corr (float): The correlation coefficient.

nvar (int): The number of variables in the distribution.

Returns:

np.ndarray: The generated multivariate normal distribution.

"""

mu = np.zeros(nvar)

std = (np.abs(np.random.normal(loc = 1, scale = .5,size = (nvar,1))))**(1/2)

# generate random normal distribution

acc = []

for i in range(nvar):

acc.append(np.reshape(np.random.normal(mu[i],std[i],nobs),(nobs,-1)))

normvars = np.concatenate(acc,axis=1)

cov = fn_generate_cov(nvar)

C = np.linalg.cholesky(cov)

Y = np.transpose(np.dot(C,np.transpose(normvars)))

return Y

def fn_randomize_treatment(N: int, p: float = 0.5) -> np.ndarray:

"""

Randomize the treatment assignment for a population.

Parameters:

N (int): The total population size.

p (float): The proportion of the population to be treated. Defaults to 0.5.

Returns:

np.ndarray: A binary array where 1 indicates treatment and 0 indicates control.

"""

treated = random.sample(range(N), round(N*p))

return np.array([(1 if i in treated else 0) for i in range(N)]).reshape([N,1])

def split_columns(X: np.ndarray, a: float, b: float, p0: int) -> np.ndarray:

"""

Splits the input array into two sections based on given percentages.

Parameters:

X (np.ndarray): The input array of size (n, p).

a (float): The percentage of the first p0 columns to keep (between 0 and 1).

b (float): The percentage of the remaining columns to keep (between 0 and 1).

p0 (int): The index up to which to apply the first percentage.

Returns:

np.ndarray: The output array containing a% of the first p0 columns and b% of the remaining columns.

"""

if not (0 <= a <= 1 and 0 <= b <= 1):

raise ValueError("a and b must be between 0 and 1.")

if not (0 <= p0 <= X.shape[1]):

raise ValueError("p0 must be between 0 and number of columns in X.")

first_part = X[:, :p0]

second_part = X[:, p0:]

first_indices = np.random.choice(first_part.shape[1], int(a * first_part.shape[1]), replace=False)

second_indices = np.random.choice(second_part.shape[1], int(b * second_part.shape[1]), replace=False)

return np.concatenate((first_part[:, first_indices], second_part[:, second_indices]), axis=1)

def fn_generate_data(tau: float, N: int, p: int, p0: int, corr: float, flagX: bool = False):

"""

Generate synthetic data for experimentation.

Parameters:

tau (float): Treatment effect.

N (int): Number of observations.

p (int): Number of covariates.

p0 (int): Number of covariates with nonzero coefficients.

corr (float): Correlation for multivariate normal.

flagX (bool): If True, return covariates. Defaults to False.

Returns:

tuple: Depending on flagX, either returns (Yab,T) or (Yab,T,X).

"""

X = fn_generate_multnorm(N,corr,p)

T = fn_randomize_treatment(N) # choose treated units

err = np.random.normal(0,1,[N,1])

beta0 = np.random.normal(5,5,[p,1])

beta0[p0:p] = 0 #set the coefficient of all covariates after p0 to 0

Yab = tau*T+X@beta0+err

if flagX==False:

return (Yab,T)

else:

return (Yab,T,X)

def fn_tauhat_means(Yt: np.ndarray, Yc: np.ndarray) -> tuple:

"""

Calculate the treatment effect estimate and its standard error.

Parameters:

Yt (np.ndarray): Outcome for treated units.

Yc (np.ndarray): Outcome for control units.

Returns:

tuple: The treatment effect estimate and its standard error.

"""

nt = len(Yt)

nc = len(Yc)

tauhat = np.mean(Yt)-np.mean(Yc)

se_tauhat = (np.var(Yt,ddof=1)/nt+np.var(Yc,ddof=1)/nc)**(1/2)

return (tauhat,se_tauhat)

def fn_bias_rmse_size(theta0: float, thetahat: float, se_thetahat: float, cval: float = 1.96) -> tuple:

"""

Calculate bias, RMSE, and test size for the parameter estimate.

Parameters:

theta0 (float): The true parameter value.

thetahat (float): The estimated parameter value.

se_thetahat (float): The standard error of the estimated parameter value.

cval (float): The critical value for the hypothesis test. Defaults to 1.96.

Returns:

tuple: The bias, RMSE, and test size for the parameter estimate.

"""

b = thetahat - theta0

bias = np.mean(b)

rmse = np.sqrt(np.mean(b**2))

tval = b/se_thetahat

size = np.mean(1*(np.abs(tval)>cval))

# note size calculated at true parameter value

return (bias,rmse,size)

def fn_run_experiments(tau: float, Nrange: list, p: int, p0: int, corr: float, flagX: bool = False, a: float = None, b: float = None) -> tuple:

"""

Run experiments by generating synthetic data and estimate treatment effect.

Parameters:

tau (float): Treatment effect.

Nrange (list): Range of number of observations.

p (int): Total number of covariates.

p0 (int): Number of covariates with nonzero coefficients.

a (float, optional): Percentage of the first p0 columns to keep (between 0 and 1). Only used if flagX is True.

b (float, optional): Percentage of the remaining columns to keep (between 0 and 1). Only used if flagX is True.

corr (float): Correlation for multivariate normal.

flagX (bool): If True, return covariates. Defaults to False.

Returns:

tuple: The treatment effect estimates and their standard errors, and 95% confidence interval.

Note:

In the flagX == 2 case, the function uses the split_columns function to select a% of the first p0 columns

and b% of the remaining columns from the X data, before performing the regression and estimating the treatment effect.

"""

n_values = []

tauhats = []

sehats = []

lb = []

ub = []

for N in tqdm(Nrange):

n_values = n_values + [N]

if flagX==False:

Yexp,T = fn_generate_data(tau,N,p,p0,corr,flagX)

Yt = Yexp[np.where(T==1)[0],:]

Yc = Yexp[np.where(T==0)[0],:]

tauhat,se_tauhat = fn_tauhat_means(Yt,Yc)

elif flagX==1:

# use the correct covariates in regression

Yexp,T,X = fn_generate_data(tau,N,p,p0,corr,flagX)

covars = np.concatenate([T,X],axis = 1)

mod = sm.OLS(Yexp,covars)

res = mod.fit()

tauhat = res.params[0]

se_tauhat = res.HC1_se[0]

elif flagX==2:

# use fraction a of the correct covariates and fraction b of the remaining covariates

assert a is not None and b is not None, "Please provide valid 'a' and 'b' when flagX is 2"

Yexp,T,X = fn_generate_data(tau,N,p,p0,corr,flagX)

Xreg = split_columns(X,a,b,p0)

covars = np.concatenate([T,Xreg],axis = 1)

mod = sm.OLS(Yexp,covars)

res = mod.fit()

tauhat = res.params[0]

se_tauhat = res.HC1_se[0]

tauhats = tauhats + [tauhat]

sehats = sehats + [se_tauhat]

lb = lb + [tauhat-1.96*se_tauhat]

ub = ub + [tauhat+1.96*se_tauhat]

return (n_values,tauhats,sehats,lb,ub)

def fn_plot_with_ci(n_values: list, tauhats: list, tau: float, lb: list, ub: list, caption: str):

"""

Plot the treatment effect estimates and their 95% confidence intervals.

Parameters:

n_values (list): List of number of observations.

tauhats (list): List of treatment effect estimates.

tau (float): True treatment effect.

lb (list): List of lower bounds of the confidence intervals.

ub (list): List of upper bounds of the confidence intervals.

caption (str): Title for the plot.

"""

fig = plt.figure(figsize = (10,6))

plt.plot(n_values,tauhats,label = '$hat{\tau}$')

plt.xlabel('N')

plt.ylabel('$hat{\tau}$')

plt.axhline(y=tau, color='r', linestyle='-',linewidth=1,

label='True $\tau$={}'.format(tau))

plt.title('{}'.format(caption))

plt.fill_between(n_values, lb, ub,

alpha=0.5, edgecolor='#FF9848', facecolor='#FF9848',label = '95% CI')

plt.legend()

## Experiments with no covariates in the DGP

Below we simulate data that follow a DGP in which the outcome is only influenced by the treatment and a random error.

`y_i = tau*T_i+e_i`

If the outcome is only influenced by the treatment, the estimates of the treatment effect parameter are precise even for relatively small sample sizes and quickly converge to the true parameter value as the sample size increases. In the code below, the value of the parameter `tau`

is set to 2.

`tau = 2`

corr = .5

p = 10

p0 = 0 # number of covariates used in the DGP

Nrange = range(10,1000,2) # loop over N values

(nvalues,tauhats,sehats,lb,ub) = fn_run_experiments(tau,Nrange,p,p0,corr)caption = """Estimates of the treatment effect parameter

for a randomized experiment without covariates"""

fn_plot_with_ci(nvalues,tauhats,tau,lb,ub,caption)

## For a selected sample size, verify that this is the same as running a regression with an intercept.

You can verify that you obtain the same results by running an OLS regression of the outcome on an intercept and the treatment.

`N = 100`

Yexp,T = fn_generate_data(tau,N,10,0,corr)

Yt = Yexp[np.where(T==1)[0],:]

Yc = Yexp[np.where(T==0)[0],:]

tauhat,se_tauhat = fn_tauhat_means(Yt,Yc)

# n_values = n_values + [N]

# tauhats = tauhats + [tauhat]

lb = lb + [tauhat-1.96*se_tauhat]

ub = ub + [tauhat+1.96*se_tauhat]print(f"Parameter estimate and stadard error obtained by calculating the difference in means:{tauhat:.5f},{se_tauhat:.5f}")

const = np.ones([N,1])

model = sm.OLS(Yexp,np.concatenate([T,const],axis = 1))

res = model.fit()

print(f"Parameter estimate and stadard error obtained by running an OLS regression with an intercept:{res.params[0]:.5f},{ res.HC1_se[0]:.5f}")

`Parameter estimate and stadard error obtained by calculating the difference in means:1.91756,0.21187`

Parameter estimate and stadard error obtained by running an OLS regression with an intercept:1.91756,0.21187

## Run R Monte Carlo iterations and compute bias, RMSE, and size

Now you will run Monte Carlo simulations in which you increase the sample size by looping over a list of values for the`N`

parameter. You compute the test’s RMSE, bias, and empirical size for each iteration.

This Python script conducts an experimental simulation to study how the sample size (`N`

) affects the bias, RMSE, and size of A/B test performance when no covariates are considered. Let’s break it down step by step:

1. `estDict = {}`

initializes an empty dictionary to store the experimental results.

2. `R=2000`

sets the number of repetitions for the experiment to 2000.

3. `for N in [10,50,100,500,1000]`

loops over different sample sizes.

4. Inside this loop, `tauhats=[], sehats=[]`

are initialized as empty lists to store the estimated treatment effects `tauhat`

and their corresponding standard errors `se_tauhat`

for each experiment.

5. `for r in tqdm(range(R)):`

loops over R experiments, with a progress bar provided by `tqdm`

.

6. `Yexp,T = fn_generate_data(tau,N,10,0,corr)`

generates synthetic data for each experiment with a predefined treatment effect `tau`

, number of observations `N`

, 10 covariates, no covariates with nonzero coefficients, and a predefined correlation.

7. `Yt=Yexp[np.where(T==1)[0],:]`

and `Yc=Yexp[np.where(T==0)[0],;]`

separates the synthetic data into treated and control groups.

8. `tauhat,se_tauhat=fn_tauhat_means(Yt,Yc)`

calculates the treatment effect estimate and its standard error.

9. `tauhats=tauhats+[tauhat]`

and `sehats=sehats+[se_tauhat]`

appends the treatment effect estimate and its standard error to the corresponding lists.

10. `estDict[N]={‘tauhat':np.array(tauhats).reshape([len(tauahts),1]),’sehat':np.array(sehats).reshape([len(sehats),1])}`

stores the estimates in the dictionary with the sample size as the key.

11. `tau0 = tau*np.ones([R,1])`

creates an array of size R with all elements equal to the true treatment effect.

12. For each sample size in `estDict`

the script calculates and prints the bias, RMSE, and size of the treatment effect estimate using the `fn_bias_rmse_size()`

function.

As expected, bias and RMSE go down as the sample size increases, and the size approaches the true size, 0.05.

`estDict = {}`

R = 2000

for N in [10,50,100,500,1000]:

tauhats = []

sehats = []

for r in tqdm(range(R)):

Yexp,T = fn_generate_data(tau,N,10,0,corr)

Yt = Yexp[np.where(T==1)[0],:]

Yc = Yexp[np.where(T==0)[0],:]

tauhat,se_tauhat = fn_tauhat_means(Yt,Yc)

tauhats = tauhats + [tauhat]

sehats = sehats + [se_tauhat]

estDict[N] = {

'tauhat':np.array(tauhats).reshape([len(tauhats),1]),

'sehat':np.array(sehats).reshape([len(sehats),1])

}tau0 = tau*np.ones([R,1])

for N, results in estDict.items():

(bias,rmse,size) = fn_bias_rmse_size(tau0,results['tauhat'],

results['sehat'])

print(f'N={N}: bias={bias}, RMSE={rmse}, size={size}')

`100%|██████████| 2000/2000 [00:00<00:00, 3182.81it/s]`

100%|██████████| 2000/2000 [00:00<00:00, 2729.99it/s]

100%|██████████| 2000/2000 [00:00<00:00, 2238.62it/s]

100%|██████████| 2000/2000 [00:04<00:00, 479.67it/s]

100%|██████████| 2000/2000 [02:16<00:00, 14.67it/s]

N=10: bias=0.038139125088721144, RMSE=0.6593256331782233, size=0.084

N=50: bias=0.002694446014687934, RMSE=0.29664599979723183, size=0.0635

N=100: bias=-0.0006785229668018156, RMSE=0.20246779253127453, size=0.0615

N=500: bias=-0.0009696751953095926, RMSE=0.08985542730497854, size=0.062

N=1000: bias=-0.0011137216061364087, RMSE=0.06156258265280801, size=0.047

## Experiments with covariates in the DGP

Next, you will add covariates to the DGP. Now the outcome of interest depends not only on the treatment but also on some other variables `X`

. The code below simulates data with 50 covariates included in the DGP. Using the same sample sizes and treatment effect parameter as in the previous simulation without covariates, you can see that this time the estimates are much more noisy, but they still converge to the correct solution.

`y_i = tau*T_i + beta*x_i + e_i`

The plot below compares the estimates from the two DGPs— you can see how much more noisy the estimates are when covariates are introduced in the DGP.

`tau = 2`

corr = .5

p = 100

p0 = 50 # number of covariates used in the DGP

Nrange = range(10,1000,2) # loop over N values

(nvalues_x,tauhats_x,sehats_x,lb_x,ub_x) = fn_run_experiments(tau,Nrange,p,p0,corr)caption = """Estimates of the treatment effect parameter

for a randomized experiment with X's in the DGP but no X's included in the estimator"""

fn_plot_with_ci(nvalues_x,tauhats_x,tau,lb_x,ub_x,caption)

# rerun experiment with no covariates

p0 = 0 # number of covariates used in the DGP

Nrange = range(10,1000,2) # loop over N values

(nvalues_x0,tauhats_x0,sehats_x0,lb_x0,ub_x0) = fn_run_experiments(tau,Nrange,p,p0,corr)

fig = plt.figure(figsize = (10,6))

plt.title("""Estimates of the treatment effect parameter

for a randomized experiment with X's in the DGP but no X's included in the estimator, zoom in for large N""")

plt.plot(nvalues_x[400:],tauhats_x[400:],label = '$hat{\tau}^{(x)}$')

plt.plot(nvalues_x[400:],tauhats_x0[400:],label = '$hat{\tau}$',color = 'green')

plt.legend()

fig = plt.figure(figsize = (10,6))

plt.title("""

Treatment effect estimates from DGP with and without covariates, zoom in for large N

""")

plt.plot(nvalues_x[400:],tauhats_x[400:],label = '$hat{\tau}^{(x)}$')

plt.plot(nvalues_x[400:],tauhats_x0[400:],label = '$hat{\tau}$',color = 'green')

plt.legend()

`100%|██████████| 495/495 [00:41<00:00, 12.06it/s]`

100%|██████████| 495/495 [00:42<00:00, 11.70it/s]

Does repeating the experiment with a much larger sample size alleviate the problem? Not necessarily. Despite the increase in sample size, the estimates still appear quite noisy.

`tau = 2`

corr = .5

p = 100

p0 = 50 # number of covariates used in the DGP

Nrange = range(1000,50000,10000) # loop over N values

(nvalues_x2,tauhats_x2,sehats_x2,lb_x2,ub_x2) = fn_run_experiments(tau,Nrange,p,p0,corr)fn_plot_with_ci(nvalues_x2,tauhats_x2,tau,lb_x2,ub_x2,caption)

## DGP with X — adding covariates to the regression

In this part, you will use the same DGP as before:

`y_i = tau*T_i + beta*x_i + e_i`

Now, you will include these covariates `X `

in the regression model. You will find that this significantly improves the precision of the estimates. However, keep in mind that this is a bit of a “cheat” — in this case, you have included the correct covariates from the beginning.

In a real-world scenario, you may not know which covariates are the “right” ones to include, and it may be necessary to experiment with different models and covariates.

`tau = 2`

corr = .5

p = 100

p0 = 50 # number of covariates used in the DGP

Nrange = range(100,1000,2) # loop over N values

# we need to start with more observations than p

flagX = 1

(nvalues2,tauhats2,sehats2,lb2,ub2) = fn_run_experiments(tau,Nrange,p,p0,corr,flagX)caption = """Estimates of the treatment effect parameter

for a randomized experiment with X's in the DGP,

estimates obtained using regression with the right Xs"""

fn_plot_with_ci(nvalues2,tauhats2,tau,lb2,ub2,caption)

**What happens if you use some X’s that influence the outcome and some that don’t?**

This section examines the inclusion of some relevant and some irrelevant covariates in a regression model. This mimics a real-world scenario where it might not be clear which covariates influence the outcome.

Even though some non-influential variables are included, it can be observed that the overall estimate tends to improve compared to the case when no covariates were included. However, including irrelevant variables may introduce some noise and uncertainty into the estimates, and they may not be as precise as when only the relevant covariates were included.

In conclusion, understanding the influence of covariates in your data is essential to improve the precision and reliability of A/B testing results. This tutorial explores statistical properties of estimators like RMSE, bias, and size and demonstrates how they can be estimated and understood through Monte Carlo simulations. It also highlights the impact of including covariates in the DGP and regression models, emphasizing the importance of careful model selection and hypothesis testing in practice.

`# Use same DGP as before`

tau = 2

corr = .5

p = 100

p0 = 50 # number of covariates used in the DGP

a = 0.9

b = 0.1

Nrange = range(100,1000,2) # loop over N values

# we need to start with more observations than p

flagX = 2

(nvalues3,tauhats3,sehats3,lb3,ub3) = fn_run_experiments(tau,Nrange,p,p0,corr,flagX,a,b)caption = f"""Estimates of the treatment effect parameter

for a randomized experiment with X's in the DGP,

estimates obtained using regression with the {100*a:.1f}% of the correct Xs and {100*b:.1f}% of the irrelevant Xs"""

fn_plot_with_ci(nvalues3,tauhats3,tau,lb3,ub3,caption)

Unless otherwise noted, all images are by the author.