#### How do hazards and maximum likelihood estimates predict event rankings?

#### Introduction

The goal of Cox regression is to model the relationship between predictor variables and the time it takes for an event to happen — like events that only happen once. Let’s dive into a made-up dataset with 5 subjects, labeled A to E. During the study, each subject either experienced an event (event = 1) or not (event = 0). On top of that, each subject got assigned a single predictor, let’s call it *x*, before the study. As a practical example, if we are tracking the death events, then *x* could be the dosage of a drug we are testing to see if it helps people live longer, by affecting the time until death.

import pandas as pd

import numpy as np

sample_df = pd.DataFrame({

'subject': ['A', 'B', 'C', 'D', 'E'],

'time': [1, 3, 5, 4, 6],

'event': [1, 1, 1, 1, 0],

'x': [-1.7, -0.4, 0.0, 0.9, 1.2],

})

sample_df

In this dataset, subject E did not experience anything during the study, so we set event = 0 and the time assigned is basically the last moment we knew about them. This kind of data is called “censored” because we have no clue if the event happened after the study ended. To make it easier to understand, a cool “lollipop” 🍭 plot comes in handy for visualizing this type of data:

from cox.plots import plot_subject_event_times

plot_subject_event_times(sample_df)

(️️Plotting functions available on my Github repo.)

Just a heads up: I want to make the main ideas of Cox regression easier to grasp, so we will focus on data where only one event can happen at a given time (no ties).

### Hazards

Hazard represents the instant rate (probability per unit time) at which an event can occur at a specific time — assuming the event has not occurred before that point. Since it is a rate at which events take place, it can have arbitrary units and unlike probability, hazard values can range between 0 and infinity: [0, ∞).

In Cox regression, hazard works similar to the odds in logistic regression. In logistic regression, odds = p/(1-p) takes probabilities from the range of [0, 1] and transforms them to a range of [0, ∞). Then, taking the logarithm converts the odds to log-odds, which can have values from negative infinity to positive infinity (-∞, ∞). This log-odds transformation of probabilities is done to match the possible output values to the linear combination of predictors *β₁x₁ + β₂x₂ + …*, which also can range (-∞, ∞).

Here, we start with a hazard h(t, x) that already has values ranging from 0 to infinity [0, ∞). By applying a logarithm, we transform that range to (-∞, ∞), allowing it to be fitted with a linear combination of predictors:

In Cox regression, there’s another assumption that helps make things easier. It assumes that all the time dependence of the log-hazard is packed into the intercept term:

The motivation behind this assumption is to simplify the fitting process significantly (we will show how in a moment). In the literature, the intercept term *β₀(t)* is usually moved to the left side of this equation and expressed as a *baseline hazard* *log[h₀(t)]*:

From this equation, we can expressed the hazard *h(t, x)*:

Now, here is the cool part: since the data from each subject only comes to hazard through predictors *x*, every subject’s hazard has the same time dependence. The only difference lies in the *exp(βx)* part, which makes hazards from different subjects proportional to each other. That’s why this model is also called a “proportional hazard” model.

We have been drawing a lot of parallels to the logistic regression. If you read my previous post on logistic regression:

you may be wondering if Cox regression also suffers from predictors being “too good”. Stay tuned for the next post which will cover that!

### Likelihoods

Cox models are fit using a method called Maximum Likelihood Estimation (MLE). Likelihoods are quite similar to probabilities: they share the same equation — almost like two sides of the same coin. Probabilities are functions of data *x*, with fixed model parameters *β*s, while likelihoods are functions of *β*s, with fixed *x*. It‘s like looking at the probability density of a Normal distribution, but instead of focusing on *x*, we focus on *μ* and *σ*.

The MLE fitting process begins with a rank-order of event occurrence. In our made-up data this order is: A, B, D, C, with E censored. This is the only instance where time comes into play in Cox regression. **The actual numeric values of event times do not matter at all, as long as the subjects experience the events in the same order.**

We then move on to each subject one by one and estimate the probability or likelihood of that subject experiencing an event compared to all the other subjects who are *still at risk* of having an event. For instance, take subject A who experienced an event at t = 1. The likelihood of this happening is determined by the hazard rate at which subject A experiences the event, relative to the combined hazard rates of everyone else who is still at risk at t = 1 (which includes everyone):

As you may have noticed, we did not bother defining the baseline hazard *h₀(t)* because it actually cancels out completely from the likelihood calculation.

Once we substitute the values for *x* (-1.7, -0.4, 0.0, 0.9, 1.2) for each subject, we end up with an equation that only has *β* left:

From this point on, for any time after t = 1, the hazard of subject A is considered zero and is not taken into account when calculating further likelihoods. For instance, at another time, t = 3, subject B experiences an event. So, the likelihood for subject B is determined relative to the hazards of subjects B through E only:

We could keep going and calculate the likelihoods for all subjects A through D, but we will cover that in the coding part in the next section. Subject E, being censored and not experiencing an event does not have its own likelihood. Since the censored data is only used in the likelihoods of uncensored subjects, the resulting combined likelihood is often referred to as “partial likelihood”.

To summarize this process, we can create an animated lollipop plot:

from cox.plots import animate_subject_event_times_and_mark_at_risk

animate_subject_event_times_and_mark_at_risk(sample_df).save(

'../images/cox_likelihood_fitting_sample.gif'

)

### Finding β

When events occur independently of each other, the joint probability or likelihood of observing all events can be calculated by multiplying individual likelihoods, denoted as *L= L(A) L(B) L(C) L(D)*. However, multiplying exponential expressions can lead to numerical errors, so we often take the logarithm of this likelihood. By applying the logarithm, we transform the product of likelihoods into a sum of log-likelihoods:

Since the logarithm is a monotonic function, both the likelihood and the log-likelihood reach their maximum point at the same value of *β*. To facilitate visualization and enable comparison with other cost functions, we can define a cost as the negative log-likelihood and aim to minimize it instead.

### Ready, Set, Code!

We can implement this algorithm in Python, step by step. First, we need to extract the event time and predictor *x* for each uncensored subject. This can be accomplished using the function event_time_and_x_from_subject(). Once we have the event time for a subject, we can subset our data frame to identify the rows corresponding to all subjects who are still at risk. This is achieved using the function subjects_at_risk_data(). Finally, we calculate the log-likelihood for each subject using the function log_likelihood():

def event_time_and_x_from_subject(df, subject):

subject_with_event_df = df.query(f"subject == '{subject}' & event == 1")

if subject_with_event_df.empty: # Censored subjects

return (np.inf, 0)

return subject_with_event_df.iloc[0][['time', 'x']]

def subjects_at_risk_data(df, subject):

time = event_time_and_x_from_subject(df, subject)[0]

return df.query(f'time >= {time}')

def log_likelihood(df, subject, beta):

x_subjects_at_risk = subjects_at_risk_data(df, subject)['x']

x_subject = event_time_and_x_from_subject(df, subject)[1]

at_risk_hazards = np.exp(beta * x_subjects_at_risk)

return beta * x_subject - np.log(np.sum(at_risk_hazards))

For visualization purposes, we plot the cost or negative log-likelihoods. Therefore, we need to compute these values for each subject at a specific *β* value:

def neg_log_likelihood_for_all_subjects(df, beta):

subjects = df.query("event == 1")['subject'].tolist()

neg_log_likelihoods = [

-log_likelihood(df, subject, beta)

for subject in subjects

]

return pd.DataFrame({

'subject': subjects,

'neg_log_likelihood': neg_log_likelihoods

})

To find the minimum cost, we sweep through the range of *β* values:

def neg_log_likelihood_all_subjects_sweep_betas(df, betas=np.arange(-5, 5, 0.1)):

loglikelihoods_per_beta = []

for beta in betas:

beta_df = neg_log_likelihood_for_all_subjects(df, beta)

beta_df.insert(0, 'beta', beta) # Add beta column

loglikelihoods_per_beta.append(beta_df)

return pd.concat(loglikelihoods_per_beta)

negloglik_sweep_betas_df = neg_log_likelihood_all_subjects_sweep_betas(sample_df)

negloglik_sweep_betas_df

### Making sense of it all

Rather than aggregating the data frame by summing the log-likelihoods grouped by subject, we can keep it in its current form and visualize it as stacked bar charts. In this visualization, the total height of each bar corresponds to the sum of negative log-likelihoods. Each subject is represented by a different color within the bar, indicating their individual likelihood and contribution to the overall likelihood:

from cox.plots import plot_cost_vs_beta

plot_cost_vs_beta(negloglik_sweep_betas_df, width=0.1)

Let’s understand how to interpret this plot.

First, note that the negative log-likelihood (cost) is large when the likelihood and hazards are small. Think of the y-axis similar to -log(p-value); larger values indicate lower probability.

Second, censored subjects (like subject E) do not have their own individual likelihoods, so they don’t appear on the plot. However, their contribution is incorporated into the likelihoods of subjects A to D.

Now, consider different scenarios based on the value of *β*:

- If
*β*is large and negative, subjects A, B, and C (with*x ≤ 0*) and their events are fitted nearly perfectly. The likelihoods of these subjects are all close to one. However, fitting such large negative*β*values for A, B, and C comes at the cost of subject D. In this range of*β*, the probability of subject D (with*x > 0*) having an event is very low. As a result, the total cost is dominated by the small likelihood of subject D. - If
*β*is large and positive, the hazard of subject D (with*x > 0*) becomes significant compared to the other hazards. The purple section (subject D) becomes a small part of the total cost. However, since subjects A, B, and C all have*x ≤ 0*, the cost of fitting a large*β*for them is high. Consequently, these subjects dominate the total cost. - The optimal value of β, located around 2 on the plot, strikes a balance between assigning high probabilities to events of subjects A, B, and C versus subject D. This optimal value can be verified numerically:

negloglik_sweep_betas_df

.groupby("beta")

.agg(sum_neg_log_likelihood=('neg_log_likelihood', 'sum'))

.reset_index()

.sort_values('sum_neg_log_likelihood')

.head(1)

### Using lifelines library

Now that we have a better understanding of how Cox regression works, we can apply it to sample data using Python’s lifelines library to verify the results. Here is a code snippet for our made-up data:

from lifelines import CoxPHFitter

sample_cox_model = CoxPHFitter()

sample_cox_model.fit(sample_df, duration_col='time', event_col='event', formula='x')

sample_cox_model.print_summary()

In the output, we can observe the coefficient (coef) value of -1.71, which corresponds to the β coefficient. Next to it, we have exp(coef), representing exp(β), as well as columns indicating standard errors and confidence intervals. The “partial log-likelihood” value is -2.64, which matches our manual result.

Lastly, it is important to mention that Cox regression implementations also offer an extension that allows the model to handle multiple events tied at the same event time. However, this goes beyond the scope of this discussion.

### Conclusion

There was a lot to “unbox” here:

- Cox regression models the association between predictor variables and the rank-order of times to an event occurrence.
- Actual numeric values of event times do not matter at all, as long as the subjects experience the events in the same order.
- Hazards are probabilities per unit time and can have arbitrary units, while likelihoods are related the probability of events occurring.
- Stacked bar charts can be used to provide insights into maximum likelihood estimation by stacking individual negative log-likelihoods and exploring how they change with predictors
*x*. - By striking a balance between assigning probabilities to events for various subjects, MLE finds
*β*for which the observed data is*most likely to happen*.

#### To be continued… 👀

### References

- Code with plots: https://github.com/igor-sb/blog/blob/main/posts/cox/plots.py
- Survival Data Analysis slides from Prof. Patrick Breheny: https://myweb.uiowa.edu/pbreheny/7210/f19/index.html
- ChatGPT for text cleanup and funny hallucinations

Unbox the Cox: Intuitive Guide to Cox Regressions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.