Help with simple survival data simulation and exponential model fitting

I’m doing a really basic hazard model fitting using simulated data. Here is the Stan model:

data {
    int N_subjects;
    vector[N_subjects] time;
    vector[N_subjects] event;
    vector[N_subjects] group;
}
parameters {
    real<lower=0> h_0; // baseline mean survival time
    real B_h1;
}
transformed parameters {
    vector[N_subjects] lambda;

    lambda = (1.0 / h_0) * exp(group * B_h1);
}
model {
    h_0 ~ lognormal(log(180), 0.2);
    B_h1 ~ normal(0, 10);

    target += event .* exponential_lpdf(time | lambda);
    target += (1 - event) .* exponential_lccdf(time | lambda);
}

I’m the simulating some data using the following code (this is LLM-generated code but seems reasonable enough for the simulation data):

N_SUBJECTS = 2000
H0_TRUE = 180.0  # True baseline mean survival time
B_H1_TRUE = -0.6  # The true coefficient we want to recover
CENSOR_RATE = 0.3  # Percentage of subjects to be censored

# --- 2. Simulate Data ---
np.random.seed(42)

# Create the covariate (group)
group = np.random.randint(0, 2, size=N_SUBJECTS)

# Calculate the rate for each subject based on the PH model
# rate = (1 / baseline_mean_time) * hazard_ratio
lambda_ind = (1.0 / H0_TRUE) * np.exp(group * B_H1_TRUE)

# Generate true event times from an exponential distribution
# T = -log(U) / lambda, where U ~ Uniform(0,1)
u = np.random.uniform(0, 1, size=N_SUBJECTS)
true_event_time = -np.log(u) / lambda_ind

# Generate censoring times from a separate exponential distribution
# This creates random right-censoring
censor_time = np.random.exponential(scale=H0_TRUE / CENSOR_RATE, size=N_SUBJECTS)

# Observe the final time and event status
time = np.minimum(true_event_time, censor_time)
event = (true_event_time <= censor_time).astype(int)

# Create a DataFrame
df = pd.DataFrame({"time": time, "event": event, "group": group})

For some reason, I’m able to recover the B_h1 relative risk more closely using sksurv’s CoxPHSurvivalAnalysis, while the B_h1 estimate produced by Stan is the following:

            Mean      MCSE  ...  ESS_tail    R_hat
h_0   200.653000  0.004460  ...   1536.30  1.00487
B_h1   -0.377075  0.000034  ...   1569.86  1.00420

(For reference, the CoxPHSurvivalAnalysis model produces an estimate of -0.54)

I’m completely stumped here - what could possibly be going wrong? I’m new to Stan and just trying to figure out how to get through the basics, but somehow got caught up with a very simple model.

Thank you for the help!

Hi @ruben-s , welcome to the Stan community!

There is an error with how the target += statements are written, where you aren’t actually fitting the model you intended to fit. If you write it more carefully in an iterative, you should get the results you are looking for:

for (n in 1:N_subjects) {
    if (event[n] == 1)
        target += exponential_lpdf(time[n] | lambda[n]);
    else
        target += exponential_lccdf(time[n] | lambda[n]);
}

When I run this updated model using your data generation script, I get a posterior mean for B_h1 of -0.561 (and a 90% interval of (-0.65, -0.47)).

The core disconnect here lies in how Stan’s vectorization works. When you write exponential_lpdf(time | lambda) where time and lambda are vectors, this does not produce a vector of log probabilities, it produces the sum of those log probabilities. So, when you write event * exponential_lpdf(time | lambda), this produces a vector that is 0 wherever event is 0 and \sum_n \log p(t_n | \lambda_n) wherever event is 1.

(As a side note; the Stan code you provided is not valid, event .* exponential_lpdf(time | lambda) should not compile with the left-side being a vector and the right side being a real number. I’m assuming you meant *, but it’s unclear.)

Some other thoughts on the code provided. You code all your input data as vectors, which can be fine, but given that event and group are binary flags, it may be more reasonable to code them as array[N_subjects] int<lower=0,upper=1>. This will perform data validation on inputs, which can avoid some pernicious bugs.

Also, you construct lambda within the transformed parameters block. Since there are only two values for group, you are performing unnecessary calculations (and returning the N_subjects length array as outputs). Instead, you can drop the transformed parameter block and instead compute lambda within the model block like:

for (n in 1:N_subjects) {
    real lambda = (1.0 / h_0) * exp(group[n] * B_h1);
    if (event[n] == 1)
        target += exponential_lpdf(time[n] | lambda);
    else
        target += exponential_lccdf(time[n] | lambda);
}
3 Likes

Wow, thank you so much for the awesome explanation. I really appreciate it! I can’t believe I got so much wrong in such a small amount of code…haha

1 Like