Rounding issues in ODE-solver

Dear Stan-community,

I am modelling a simple SEIR model in stan.

\frac{dS}{dt} = - \beta S I \\ \frac{dE}{dt} = \beta S I - \tau E \\ \frac{dI}{dt} = \tau E - \gamma I \\ \frac{dR}{dt} = \gamma I

Due to limitations in the accuracy of the solver ( I am using ode_rk45_tol, but had similar issues with other solvers ) I sometimes get slightly negative values for the number of individuals in the I comparmtent. This leads to errors in my program, because I use these values as the rate in a poisson process. I though this could be solved by log transforming the model. Then, the number of infected individuals would be the exponential of the output of the model and always be positive. I was hoping this would make my model more efficient.

Log transformation of the SEIR model:

Step 1, divide by variables:
\frac{1}{S} \frac{dS}{dt} = - \beta S I \frac{1}{S} \\ \frac{1}{E} \frac{dE}{dt} = (\beta S I - \tau E) \frac{1}{E} \\ \frac{1}{I} \frac{dI}{dt} = (\tau E - \gamma I) \frac{1}{I} \\ \frac{1}{R} \frac{dR}{dt} = \gamma I \frac{1}{R} \\

Step 2, simplify:
\frac{d logS}{dt} = - \beta I \\ \frac{d logE}{dt} = \frac{\beta S I}{E} - \tau \\ \frac{d logI}{dt} = \frac{\tau E}{I} - \gamma \\ \frac{d logR}{dt} = \frac{\gamma I}{R}

When I analyse this model and the original SEIR model in r, I find that both are equivalent. I also included a model in the code that is scaled by the population size and models the whole population istead of the fraction of infected individuals. The initial conditions cannot be zero, because I cannot logtransform zero. Thus, I choose very small values, this does not really influence the output of the models.


# Load library
library(deSolve)

# Set parameters and time sequence
p <- c(beta = 0.08, gamma = 2/7, tau = 5/7, contact = 77, popsize = 10000)
times      <- seq(0, 45, by = 0.01)

# Set initial coditions
s_norm     <- c(S = (1-1/10000), E=0, I = 1/10000, R = 0)
s_log     <- c(logS = log(1-1/10000), logE=-50, logI = log(1/10000), logR = -50)
s_pop     <- c(logS = log(10000-1), logE=log(0.00001), logI = log(1), logR = log(0.00001))

# if initial values for E and R are too high, the result is not the same
s_pop_stan <- c(logS = 9.21024,logE=-1,logI=0,logR=-1)

# Traditional way to code SEIR model 
gfull <- function(time, y, params) {
  g <- with(as.list(c(y,params)),
            c( -beta*contact*S*I,
               beta*contact*S*I - tau*E,
               tau*E - gamma*I,
               gamma *I )
  )
  return(list(g))
}

# Log transformed SEIR model 
gfull_log <- function(time, y, params) {
  g <- with(as.list(c(y,params)),
            c(  - beta*contact*(exp(logI)),
                beta*contact*(exp(logI)*(exp(logS)))/(exp(logE)) - tau,
                tau*(exp(logE))/(exp(logI)) - gamma,
                gamma*(exp(logI))/(exp(logR)) )
  )
  return(list(g))
}

# Log transformed and scaled with population size 
gpop <- function(time, y, params) {
  g <- with(as.list(c(y,params)),
            c( - beta*contact*(exp(logI))/popsize,
               (beta*contact*(exp(logI)*(exp(logS)))/(exp(logE)))/popsize - tau,
               tau*(exp(logE))/(exp(logI)) - gamma,
               gamma*(exp(logI))/(exp(logR)) )
  )
  return(list(g))
}


# Run all versions of the model 
result_log <- ode(y = s_log, times = times, func = gfull_log, parms = p)
result_norm <- ode(y = s_norm, times = times, func = gfull, parms = p)
result_pop <- ode(y = s_pop, times = times, func = gpop, parms = p)

# compare results for I compartment 
par(mfrow=c(1,1))
plot(times, exp(result_log[,3]), col="blue", xlab="time in weeks", ylab="Number of infected individuals", type="l", main="R deSolve model")
lines(times, result_norm[,3], col="red",type="l")
lines(times, exp(result_pop[,3])/10000,col="violet")
legend("topright", legend=c("Traditional model", "Logtransformed model", "Logtransformed population scaled"), col=c("red", "blue","violet"),lty=1)

However, when I run the exact same models in Stan, I get nan values for the ODE output if the initial values are smaller than -11 for the logtransformed model and smaller than log(0.01) for the version that is scaled by the population size.

functions {
  
  // ODE for basic SEIR model 
  vector gfull(real t, vector y, data real tau, data real gamma, real beta, 
                      data real contact, data int num_comp) {

    // define compartments
    vector[num_comp] dydt;

    // define ODEs
    dydt[1] = - beta * contact * y[1] * y[3];
    dydt[2] =   beta * contact * y[1] * y[3] - tau * y[2];
    dydt[3] =  tau * y[2] - gamma * y[3];
    dydt[4] =  gamma * y[3];

    return dydt;
  }

  // ODE for logtransformed SEIR
  vector gfull_logtrans(real t, vector y, real I0, data real tau, data real gamma, real beta, 
                        data real contact, data int num_comp) {

    // define compartments
    vector[num_comp] dydt;
  
    // define ODEs
    dydt[1] = - beta*contact*(exp(y[3]));
    dydt[2] =  (beta*contact*exp(y[3])*exp(y[1]))/exp(y[2]) - tau;
    dydt[3] =  tau*exp(y[2])/exp(y[3]) - gamma;
    dydt[4] =  gamma*exp(y[3])/exp(y[4]);
  
    return dydt;
  }

  // ODE for basic SEIR model logtransformed and scaled with populationsize
  vector gfull_pop(real t, vector y, real I0, data real tau, data real gamma, real beta,
                        data real contact, data int num_comp, data int popsize) {

    // define compartments
    vector[num_comp] dydt;
  
    dydt[1] = -beta*contact*(exp(y[3]))/popsize;
    dydt[2] =  (beta*contact*exp(y[3])*exp(y[1]))/(exp(y[2])*popsize) - tau;
    dydt[3] =  tau*(exp(y[2]))/(exp(y[3])) - gamma;
    dydt[4] =  gamma*(exp(y[3]))/(exp(y[4]));
  
    return dydt;
  }

  
}

// load data objects
data {
 // Load parameters from data
 // data objects that are generic independent of the model specifications
  int num_t;
  int popsize;
  int contact;
  array[num_t] real ts;
  
  // true values for simulated data
  real beta_sim;
  real I0_raw_sim;
  
  // fixed quantities
  real incubation_period;
  real infectious_period;
  
  // control parameters
  real rtol;
  real atol;
  int max_num_steps;
 
}

transformed data {

  // generic transformed data block
  int num_comp = 4; // total compartments
  int num_prev = 3; // index of the I compartment
  real t0 = 0.;
  
  // fixed parameters
  real tau = 1./incubation_period;
  real gamma = 1./infectious_period;
  
  // scale initial seed
  real I0_sim = I0_raw_sim/popsize;

  print("initial values traditional model:");
  print(to_vector(append_array( {(1-I0_sim), 0, I0_sim}, rep_array(0, num_comp-3))));

  // get reference simulation
  array[num_t] vector[num_comp] y_sim_traditional = ode_rk45_tol(
      gfull,
      to_vector(append_array( {(1-I0_sim), 0, I0_sim}, rep_array(0, num_comp-3))), // initial values work well 
      t0,                               // initial time = 0
      ts,                               // evaluation times
      rtol, atol, max_num_steps,        // tolerances 
      tau, gamma, beta_sim, contact,              // data
      num_comp                              // metadata
      );
      
  print("initial values log transformed:");
  print(to_vector(append_array( {log(1-I0_sim), -11, log(I0_sim)}, rep_array(-11, num_comp-3))));
      
  array[num_t] vector[num_comp] y_sim_log = ode_rk45_tol(
      gfull_logtrans,
      to_vector(append_array( {log(1-I0_sim), -11, log(I0_sim)}, rep_array(-11, num_comp-3))), // values more negative then -10 lead to Nan result for y
      t0,                               // initial time = 0
      ts,                               // evaluation times
      rtol, atol, max_num_steps,        // tolerances
      I0_sim, tau, gamma, beta_sim, contact,    // data
      num_comp                // metadata
    );
      
  print("initial values log transformed population scaled:");
  print(to_vector(append_array( {log(popsize-I0_sim*popsize), log(0.1), log(I0_sim*popsize)}, rep_array(log(0.1), num_comp-3))));
  
    array[num_t] vector[num_comp] y_sim_poplog = ode_rk45_tol(
      gfull_pop,
      to_vector(append_array( {log(popsize-I0_sim*popsize), log(0.1), log(I0_sim*popsize)}, rep_array(log(0.1), num_comp-3))), // values smaller than log(1) lead to Nan result for y 
      t0,                               // initial time = 0
      ts,                               // evaluation times
      rtol, atol, max_num_steps,        // tolerances
      I0_sim, tau, gamma, beta_sim, contact,    // data
      num_comp, popsize                 // metadata
    );

}

parameters {
                    
}

transformed parameters {
 
}

generated quantities {

  // posterior predictive check
  array[num_t] vector[num_comp] y_traditional = y_sim_traditional;
  array[num_t] vector[num_comp] y_log = y_sim_log;
  array[num_t] vector[num_comp] y_poplog = y_sim_poplog;
  
}

With these initial conditions, the output is not exactly the same for the three models:

Could anyone explain me why there is an issue with the logtransformed model in Stan but not in R? Is there something about how rounding is done in Stan that causes this issue? When I print the ODE output for every step, I see that the E and R compartment reach very high values in only a few time steps and result in inf values and nan in a next step. Is this due to the division by exp(E) and exp(R)?

I would be very keen on getting a better understanding of Stan and how the ODE solver in Stan is different from the one in R. Any pointers would be greatly appreciated.

Best regards,
Judith

This is the code I used to run and analyse the stan models:

# Load stan file that contains these functions with same parameters as used above. 
# define parameters for the spline 
fixed_pars = list(incubation_period = (5/7),
                  infectious_period=(2/7),
                  contact=7*11)

simul_pars = list(I0_raw_sim=1,
                  beta_sim=0.08,
                  ts=1:45,
                  num_t=45, 
                  popsize=10000)

data_list = c(fixed_pars,simul_pars)

data_list = within(data_list,{
  rtol = 1e-5;
  atol = 1e-5;
  max_num_steps = 1000})

# compile
mod = cmdstan_model(stan_file = "/various_tests/MRE_logtransform.stan")

samples_prior = mod$sample(
  data = data_list,
  seed = 123,
  chains = 1, parallel_chains = 1,
  iter_warmup = 1, iter_sampling = 1, 
  fixed_param=TRUE)

y_traditional <- samples_prior$summary(c("y_traditional"))
y_log <- samples_prior$summary(c("y_log"))
y_poplog <- samples_prior$summary(c("y_poplog"))
y_anthony <- samples_prior$summary(c("y_anthony"))

y_traditional_forPlot <- y_traditional[91:135,]
y_traditional_forPlot$t <- 1:45

y_log_forPlot <- y_log[91:135,]
y_log_forPlot$t <- 1:45

y_poplog_forPlot <- y_poplog[91:135,]
y_poplog_forPlot$t <- 1:45

y_anthony_forPlot <- y_anthony[91:135,]
y_anthony_forPlot$t <- 1:45

# compare results of the three models 
ggplot() + geom_line(aes(x=t, y=mean, colour="Traditional model"), data = y_traditional_forPlot) +
  geom_line(aes(x=t, y= exp(mean), colour="Logtransformed model" ), data = y_log_forPlot) +
  geom_line(aes(x=t, y= exp(mean)/10000, colour="Logtransformed population scaled"), data = y_poplog_forPlot) +
  geom_line(aes(x=t, y= exp(mean), colour="Anthony special"), data = y_anthony_forPlot) +
  xlab("time (weeks)")+ylab("Number of infected individuals") + labs(title="Results stan model") +
  scale_colour_manual(values=c("Traditional model"="red", "Logtransformed model"="blue", 
                               "Logtransformed population scaled"="violet", "Anthony special"="green")) +
  theme(legend.position = "bottom")

1 Like

UPDATE:

I just had a chat with a computer scientist that said it is likely that the ODE solver in Stan has 32-bit precision and the one in R 64-bit, causing the observed discrepancy.

So, I guess this is something that is not easily changed and I will have to find a work-around instead.

I’m quite certain that this is not the case. AFAIK, some python ML frameworks use 32 bit numbers, but AFAIK for Stan everything is double, i.e. 64-bit.

I think the reason for the discrepancy must be, that the ODE adaptation methods in Stan and R differ, and one of them (probably Stan’s) doesn’t adapt the step size “correctly” for the log-transformed ODE. You should try passing a very low atol/reltol to Stan and see what happens.

That being said, I’m surprised that it should lead to such a “big” difference. Also, FYI, I also tried log-transforming ODEs once (for the same reason as you), and I think it was generally slightly less efficient and sometimes things went a bit horribly wrong. But if that ensures positivity of the states, that might be worth the trade-off in the context of HMC/NUTS. I didn’t run any experiments where I actually sampled from anything, so I’d be interested to hear how this works out for you.

1 Like

Thanks so much for your reply @Niko!

Decreasing the the atol and reltol from 1e-5 to 1e-8 unfortunately did not change the output.

With the “wrong”, relatively high initial values for E and R, the logtranformed ODE system is sligthly faster than the traditional model (88 minutes against 91), but the difference is slow. I am a bit in doubt if the log transmission is worth the inaccuracy introduced by the initial value issue.

I might opt out for artificially bouning the output of the ODE to zero instead. I will keep you updated if I find a more elegant solution.

Huh. How strange!

Rather than exponentiating each y, would you have more stability by working on the log-scale?

For example:

  // ODE for logtransformed SEIR
  vector gfull_logtrans(real t, vector y, real I0, data real tau, data real gamma, real beta, 
                        data real contact, data int num_comp) {

    // define compartments
    vector[num_comp] dydt;

    real log_beta = log(beta);
    real log_contact = log(contact);
    real log_tau = log(tau);
    real log_gamma = log(gamma);

    dydt[1] = -exp(log_beta + log_contact + y[3]);
    dydt[2] =  exp((log_beta + log_contact + y[3] + y[1]) - y[2]) - tau;
    dydt[3] =  exp(log_tau + y[2] - y[3]) - gamma;
    dydt[4] =  exp(log_gamma + y[3] - y[4]);
  
  
    return dydt;
  }
2 Likes

Thank you @andrjohns for your great suggestion.

I was very hopeful about your suggestion, because it removes the slightly awkward division by a very small number. However, the issue persisted and I cannot start with initial values lower than -11 for your rewritten version of the model neither.

The standard “trick” I used in the past is to run the ODE integrator with an absolute error of 1E-6 and add to the output of the integrator 1E-4 (to all outputs). This assumes that 1E-4 is basically like “0” in your problem. With this approach the solution will always stay positive and things work practically just fine… it’s a hack, but it gets things done usually… obviously you need to adjust the above constants to your problem…

1 Like

Your model’s initial condition are not really matched. For example for the first model, you have

to_vector(append_array( {(1-I0_sim), 0, I0_sim}, rep_array(0, num_comp-3))),

as init condition. For the second model I used

log(to_vector(append_array( {(1-I0_sim), 1e-20, I0_sim}, rep_array(1e-20, num_comp-3))))

Despite that you mentioned only e^-11 work model can clearly run without issue with

data_list = within(data_list,{
  rtol = 1e-12;
  atol = 1e-12;
  max_num_steps = 100000})

and I’m getting matched results:

> y_traditional_forPlot$mean - exp(y_log_forPlot$mean)
 [1]  1.754682e-10  7.586444e-11  6.712295e-10  1.998728e-09  4.190562e-10
 [6]  1.187747e-09 -9.890192e-09  7.376365e-09 -4.386663e-09  6.027541e-08
[11]  9.395195e-08  8.826598e-08 -3.192914e-09  3.769290e-08 -2.005211e-08
[16]  2.060244e-08 -3.902472e-09 -1.983877e-08 -3.847148e-09 -4.374427e-09
[21] -3.197394e-09  1.566560e-10  5.822743e-10 -2.498680e-10 -2.001047e-10
[26]  1.701293e-10  7.118731e-10 -5.896439e-11 -1.113873e-10  1.864346e-10
[31] -1.541373e-10 -7.941334e-11  4.548871e-11  2.159835e-11  8.275981e-12
[36]  2.056254e-12  1.310957e-13 -8.303139e-13 -8.653749e-13  1.145616e-12
[41]  6.858054e-13  5.120895e-13  3.425366e-13  2.185677e-13  1.372995e-13

I suspect the 3rd model has similar inconsistency.

Thank you @wds15!

Thank you @yizhang for taking the time to run my code.

I wish that this was indeed the issue. However, the issue still persists. The model runs without problem, but only produces the correct values for the “traditional” model that was not log transformed. In your graph only the red line shows, the others produces nan values.

I forgot to mention that you need to switch from rk45 to bdf solver (ode_bdf_tol). Sorry about that.

1 Like

@yizhang you are right! Thanks so much, this is very helpful!