Dear Stan-community,
I have converging issues for part of the initial value space of my simple SIR model. I fit a model using data simulated with the same model, to test model performance before working with real infectious disease data.
The stan model is defined below.
functions {
vector seir_0d(real t, vector y, real beta, real I0, data real tau, data real gamma, data real contact, data array[] int DIM) {
// array with all necessary dimensions
int num_comp = DIM[1];
// define compartments
vector[num_comp] dydt;
// define initial values
vector[num_comp] init = to_vector(append_array({1-I0,0.0, I0}, rep_array(0.0, num_comp-3)));
// define time-varying force of infection (for now not time-varying)
real finf_t = beta * contact * (y[3]+init[3]);
// define ODEs
dydt[1] = -finf_t * (y[1]+init[1]);
dydt[2] = finf_t * (y[1]+init[1]) - tau * (y[2]+init[2]);
dydt[3] = tau * (y[2]+init[2]) - gamma * (y[3]+init[3]);
dydt[4] = gamma * (y[3]+init[3]);
return dydt;
}
array[] real get_prevalence(array[] vector y, array[] int DIM, real popsize, real atol, real I0) {
// dimensions
int num_comp = DIM[1];
int num_t = DIM[2];
int num_prev = DIM[3];
// extract, rescale and format prevalence
array[num_t] real prev = to_array_1d( ( to_vector(y[,num_prev]) + I0 + 2*atol ) * popsize );
return prev;
}
}
data {
int num_t;
array[num_t] real ts;
real popsize;
// true values for simulated data
real beta_sim;
real I0_raw_sim;
// priors
array[2] real p_I0; // expected initial seed (mean, sd)
array[2] real p_beta; // expected beta (alpha, beta)
// fixed quantities
real incubation_period;
real infectious_period;
real contact;
// control parameters
real rtol;
real atol;
int max_num_steps;
int inference;
//data to fit
array[num_t] int I_t_sim;
}
transformed data {
// structure
int num_comp = 4; // total compartments
int num_prev = 3; // index of the I compartment
array[3] int DIM = {num_comp,num_t,num_prev};
real t0 = 0.;
// fixed parameters
real tau = 1./incubation_period;
real gamma = 1./infectious_period;
// scale initial seed
real I0_sim = I0_raw_sim/popsize;
}
parameters {
real<lower=0,upper=1> beta; // transmission probability per contact
real<lower=0> I0_raw; // initial seed (in number of individuals)
}
transformed parameters {
real I0 = I0_raw/popsize;
array[num_t] vector[num_comp] y =
ode_rk45_tol(
seir_0d,
rep_vector(0.0,num_comp), // initial values = 0 (handled within the ODE)
t0, // initial time = 0
ts, // evaluation times
rtol, atol, max_num_steps, // tolerances
beta, I0, // parameters
tau, gamma, contact, // data
DIM // metadata
);
array[num_t] real<lower=0> prevalence = get_prevalence(y, DIM, popsize, atol, I0);
}
model {
// priors
beta ~ beta(p_beta[1],p_beta[2]);
I0_raw ~ gamma(p_I0[1]^2/p_I0[2]^2,p_I0[1]/p_I0[2]^2); // reparameterization of gamma from mean/sd to alpha/beta
// likelihood
if(inference==1){
target += poisson_lpmf(I_t_sim | prevalence);
}
}
generated quantities {
// log-likelihood for LOO
vector[num_t] log_lik;
for (t in 1:num_t)
log_lik[t] = poisson_lpmf(I_t_sim[t] | prevalence[t]);
// posterior predictive check
array[num_t] int I_t_simulated = I_t_sim;
array[num_t] int I_t_predicted = poisson_rng(prevalence);
}
The following code in R reproduces the issue. This code sometimes works, when the initial values by change are in the right range, but often ends with diverencence and a large rhat value.
# Reproducible code for stan forum
library(cmdstanr)
# simulated data
sim_data_mod1 <- c(5,1,1,3,1,4,4,3,2,6,8,17,8,16,13,24,47,63,55,62,60,97,82,107,87,
77,78,59,53,39,28,30,17,14,13,10,7,7,4,2,1,4,0,0,3)
# set parameters
fixed = list(incubation_period=5/7,
infectious_period=2/7,
contact=11*7)
prior = list(p_I0 = c(1,1),
p_beta = c(1,10))
simul = list(I0_raw_sim=1,
beta_sim=0.06,
popsize=10000,
num_t=45)
data_list = c(fixed,prior,simul)
data_list$ts = 1:simul$num_t
data_list$inference = 1
data_list = within(data_list,{
rtol = 1e-6;
atol = 1e-6;
max_num_steps = 1000;
I_t_sim = sim_data_mod1})
# load stan model
mod = cmdstan_model(stan_file = paste0("stan/mod1_rk45.stan"))
# fit simulated parameters from simulated data
samples_posterior = mod$sample(
data = data_list,
chains = 4, parallel_chains = 4,
iter_warmup = 200, iter_sampling = 1000)
To identify the range of initial values that causes issues, I ran a set of combinations of initial values. The figure below shows if the rhat is below (pink) or above(blue) 1.05 for the combination of the initial value of I0 (initial number of infected individuals) and beta (transmission rate).
The issue only appears for some combination of the parameters used to simulate the data. The convergences is better when I for instance only simulate the first half of the epidemic. However, I would feel more comfortable applying this model to real data, when I know it can infer any set of parameters.
I do not understand why my model cannot infer the simulated parameters, independent of the initial values. For this simple model, the likelihood should be relatively smooth. Could any of you give me a hint of how to handle this issue?
I would greatly appreciate any help.
Cheers,
Judith
What I have tried so far:
- A stronger prior on small values of beta or a uniform prior. (Even though the initial value does not depend on the prior).
- A traditional way of including the initial state of the ODE model.
- Increasing adapt_delta.
Additional information:
I am running cmdstan version 2.30.1 and stan version 2.21.0. The initial conditions of the ODE model are included in the definition of the ODE itself to reduce the number of parameters.