Survival analysis with simulated data - the model doesn't recover the parameters that I used to build the data

Hi,

I built this simulated dataset of cat adoptions in an animal care facility. The goal is to determine if black cats are less likely to be adopted than non-black cats. This is inspired by dataset that I found on Richard McElreath’s lectures (Statistical Rethinking Winter 2019 Lecture 13 - YouTube).

I simulated data from two exponential distributions. For black cats I chose lambda= 0.02 and for non-black cats lambda = 0.03. I randomly added censoring indicating that some of the cats had not been adopted when the study ended.

black<-round(rexp(400,0.02))
other<-round(rexp(400,0.03))

d<-data.frame(days_to_event=c(black,other),color_id=c(rep(1,400),rep(2,400)),
                 adopted=rbinom(400, 1, 0.5))

This is the model that I wrote:

data{
    int N;
    int adopted[N];
    vector[N] days_to_event;
    int color_id[N];

}
parameters{
    vector[2] a;
}
transformed parameters{
  vector[N] lambda;
  vector[N] mu;
    for ( i in 1:N ) {
        mu[i] = a[color_id[i]];
        mu[i] = exp(mu[i]);
    }
    for ( i in 1:N ) {
        lambda[i] = 1/mu[i];
    }
}
model{
    a ~ normal( 0 , 1 );
    for ( i in 1:N ) 
        if ( adopted[i] == 0 ) target += exponential_lccdf(days_to_event[i] | lambda[i]);
    for ( i in 1:N ) 
        if ( adopted[i] == 1 ) days_to_event[i] ~ exponential( lambda[i] );
}
generated quantities{
  real pred[N];
   for ( i in 1:N ) 
   if ( adopted[i] == 1 ) pred[i] = exponential_rng( lambda[i] );
   for ( i in 1:N ) 
   if ( adopted[i] == 0 ) pred[i] = exponential_lccdf(days_to_event[i] | lambda[i]);
}

I run the model:

dat <- list(
  N=nrow(d),
  days_to_event = d$days_to_event,
  color_id = d$color_id ,
  adopted = d$adopted
)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
parallel:::setDefaultClusterOptions(setup_strategy = "sequential")
fit<-stan("model.stan",data=dat)

and obtained the following results:

print(fit,"a")
Inference for Stan model: censored_cats_gq.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
a[1] 4.64       0 0.07  4.5 4.59 4.64 4.68  4.79  3271    1
a[2] 4.25       0 0.07  4.1 4.20 4.25 4.29  4.39  3617    1

Samples were drawn using NUTS(diag_e) at Mon Apr 12 12:12:12 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The is problem is that:

lambda black cats = 1/exp(4.64) = 0.009657698
lambda non-black cats = 1/exp(4.25) = 0.01426423

which are quite different from the lambdas that I used to simulate the data (0.02 and 0.03).

What am I doing wrong? Feels like I’m missing something really obvious.
Tks

2 Likes

I was generating censored observations randomly, when I should have censored observations beyond a specific threshold:

d<-data.frame(days_to_event=c(black,other),color_id=c(rep(1,1000),rep(2,1000)))
d<- d %>% mutate(adopted = if_else(days_to_event >= 200, 0, 1))

Results here:

Inference for Stan model: model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

     mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
a[1] 4.01       0 0.03 3.95 3.99 4.01 4.03  4.08  3027    1
a[2] 3.50       0 0.03 3.43 3.47 3.49 3.52  3.56  3368    1

Samples were drawn using NUTS(diag_e) at Wed Apr 14 15:46:05 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

> 1/exp(4.01)
[1] 0.0181334
> 1/exp(3.50)
[1] 0.03019738

I tried to delete this post because the problem was not related to model specification or with Stan code, but it seems only an admin can do it.

3 Likes