Survival analysis with simulated data - the model doesn't recover the parameters that I used to build the data

Hi,

I built this simulated dataset of cat adoptions in an animal care facility. The goal is to determine if black cats are less likely to be adopted than non-black cats. This is inspired by dataset that I found on Richard McElreath’s lectures (Statistical Rethinking Winter 2019 Lecture 13 - YouTube).

I simulated data from two exponential distributions. For black cats I chose lambda= 0.02 and for non-black cats lambda = 0.03. I randomly added censoring indicating that some of the cats had not been adopted when the study ended.

``````black<-round(rexp(400,0.02))
other<-round(rexp(400,0.03))

d<-data.frame(days_to_event=c(black,other),color_id=c(rep(1,400),rep(2,400)),
adopted=rbinom(400, 1, 0.5))
``````

This is the model that I wrote:

``````data{
int N;
int adopted[N];
vector[N] days_to_event;
int color_id[N];

}
parameters{
vector[2] a;
}
transformed parameters{
vector[N] lambda;
vector[N] mu;
for ( i in 1:N ) {
mu[i] = a[color_id[i]];
mu[i] = exp(mu[i]);
}
for ( i in 1:N ) {
lambda[i] = 1/mu[i];
}
}
model{
a ~ normal( 0 , 1 );
for ( i in 1:N )
if ( adopted[i] == 0 ) target += exponential_lccdf(days_to_event[i] | lambda[i]);
for ( i in 1:N )
if ( adopted[i] == 1 ) days_to_event[i] ~ exponential( lambda[i] );
}
generated quantities{
real pred[N];
for ( i in 1:N )
if ( adopted[i] == 1 ) pred[i] = exponential_rng( lambda[i] );
for ( i in 1:N )
if ( adopted[i] == 0 ) pred[i] = exponential_lccdf(days_to_event[i] | lambda[i]);
}
``````

I run the model:

``````dat <- list(
N=nrow(d),
days_to_event = d\$days_to_event,
color_id = d\$color_id ,
adopted = d\$adopted
)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
fit<-stan("model.stan",data=dat)
``````

and obtained the following results:

``````print(fit,"a")
Inference for Stan model: censored_cats_gq.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
a[1] 4.64       0 0.07  4.5 4.59 4.64 4.68  4.79  3271    1
a[2] 4.25       0 0.07  4.1 4.20 4.25 4.29  4.39  3617    1

Samples were drawn using NUTS(diag_e) at Mon Apr 12 12:12:12 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
``````

The is problem is that:

``````lambda black cats = 1/exp(4.64) = 0.009657698
lambda non-black cats = 1/exp(4.25) = 0.01426423
``````

which are quite different from the lambdas that I used to simulate the data (0.02 and 0.03).

What am I doing wrong? Feels like I’m missing something really obvious.
Tks

2 Likes

I was generating censored observations randomly, when I should have censored observations beyond a specific threshold:

d<-data.frame(days_to_event=c(black,other),color_id=c(rep(1,1000),rep(2,1000)))
d<- d %>% mutate(adopted = if_else(days_to_event >= 200, 0, 1))

Results here:

``````Inference for Stan model: model.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
a[1] 4.01       0 0.03 3.95 3.99 4.01 4.03  4.08  3027    1
a[2] 3.50       0 0.03 3.43 3.47 3.49 3.52  3.56  3368    1

Samples were drawn using NUTS(diag_e) at Wed Apr 14 15:46:05 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

> 1/exp(4.01)
[1] 0.0181334
> 1/exp(3.50)
[1] 0.03019738
``````

I tried to delete this post because the problem was not related to model specification or with Stan code, but it seems only an admin can do it.

3 Likes