Unbox the Cox: A Hidden Dark Secret of Cox Regression | by Igor Šegota | Jun, 2023

Why do perfect predictors result in a p-value of 0.93?

Photo by Dima Pechurin on Unsplash

If you have been following my previous blog posts, you might recall that logistic regression encounters a problem when trying to fit perfectly separated data, leading to an infinite odds ratio. In Cox regression, where hazard replaces odds, you might wonder if a similar issue arises with perfect predictors. It does occur, but unlike logistic regression, it is much less apparent how this occurs here and even what constitutes “perfect predictors”. As will become more clear later, perfect predictors are defined as predictors x whose ranks exactly match the ranks of event times (their Spearman correlation is one).

Previously, on “Unbox the Cox”:

we explained maximum likelihood estimation and introduced a made-up dataset with five subjects, where a single predictor, x, represented the dosage of a life-extending drug. To make x a perfect predictor of event times, here we swapped the event times for subjects C and D:

import numpy as np
import pandas as pd
import plotnine as p9

from cox.plots import (

perfect_df = pd.DataFrame({
'subject': ['A', 'B', 'C', 'D', 'E'],
'time': [1, 3, 4, 5, 6],
'event': [1, 1, 1, 1, 0],
'x': [-1.7, -0.4, 0.0, 0.9, 1.2],

plot_subject_event_times(perfect_df, color_map='x')

Image by the author.

To understand why these “perfect predictors” can be problematic, let’s pick up right where we left off and check the negative log-likelihood cost plotted against β:

negloglik_sweep_betas_perfect_df = neg_log_likelihood_all_subjects_sweep_betas(
betas=np.arange(-5, 5, 0.1)
plot_cost_vs_beta(negloglik_sweep_betas_perfect_df, width=0.1)
Image by the author.

You can see right away that there is no minimum value of β any longer: if we use very large negative values of β, we end up with log-likelihood fits that are near-perfect for all events.

Now, let’s dive into the math behind this and take a look at the likelihood of event A. We’ll dig into how the numerator and denominator change as we tweak β:


When β is high or a large positive number, the last term in the denominator (with the largest x of 1.2), representing the hazard of subject E, dominates the entire denominator and becomes exceedingly large. So, the likelihood becomes small and approaches zero:


This results in a big negative log-likelihood. The same thing happens for each individual likelihood because the last hazard of subject E, will always exceed any hazard in the numerator. As a result, the negative log-likelihood increases for subjects A to D. In this case, when we have high β, it brings down all the likelihoods, resulting in poor fits for all events.

Now, when β is low or a big negative number, the first term in the denominator, representing the hazard of subject A, dominates since it has the lowest x value. As the same hazard of subject A also appears in the numerator, the likelihood L(A) can be arbitrarily close to 1 by making β increasingly negative, thereby creating an almost perfect fit:


The same deal goes for all the other individual likelihoods: negative βs now boost the likelihoods of all events at the same time. Basically, having a negative β doesn’t come with any downsides. At the same time, certain individual hazards increase (subjects A and B with negative x), some stay the same (subject C with x = 0), and others decrease (subject D with positive x). But remember, what really matters here are ratios of hazards. We can verify this hand-waving math by plotting individual hazards:

def plot_likelihoods(df, ylim=[-20, 20]):
betas = np.arange(ylim[0], ylim[1], 0.5)
subjects = df.query("event == 1")['subject'].tolist()
likelihoods_per_subject = []
for subject in subjects:
likelihoods = [
np.exp(log_likelihood(df, subject, beta))
for beta in betas
'beta': betas,
'likelihood': likelihoods,
'subject': [subject] * len(betas),
lik_df = pd.concat(likelihoods_per_subject)
return (
p9.ggplot(lik_df, p9.aes('beta', 'likelihood', color='subject'))
+ p9.geom_line(size=2)
+ p9.theme_classic()


Image by the author.

The way likelihoods are put together, as a ratio of a hazard to the sum of hazards of all subjects still at risk, means that negative β values make a perfect fit for the likelihood of each subject whose event time rank is greater than or equal to the predictor rank! As a side note, if x had a perfect negative Spearman correlation with event times, things would be flipped around: arbitrarily positive βs would give us arbitrarily good fits.

Source link

Leave a Comment