Subject: Fitting and Predicting Influenza Data with SEEIIR Model in Stan

Hi everyone,

I’m currently working on fitting a SEEIIR model to influenza data and making predictions for future time points. I want to ensure that my implementation of the generated quantities block for prediction aligns with the fit-and-predict methodology recommended by the Stan manual. Below is my code.


functions {
  real[] seeiir(real t, real[] y, real[] theta, 
             real[] x_r, int[] x_i) {
             
      real mu = x_r[1];
      real epsilon = x_r[2];
      real lambda = x_r[3];
      real alpha_max = x_r[4]; 
      int N = x_i[1];

      real alpha_min = theta[1];
      real t_max = theta[2];
      real beta_0 = theta[3];
      real e0 = theta[4];
      real i0 = theta[5];
      real r0 = theta[6];
      
      //real init[6] = {N - (2*i0 + 2*e0 + r0), e0,e0, i0, i0, r0}; // initial values
      real sum_init = 0;  // Variable to store the sum of elements

      // Compute the sum of the elements
      //for (i in 1:6) {
     // sum_init += init[i];
     // }
     // print("Sum of elements in init array: ", sum_init);

  
      
      real S = y[1] ;
      real E1 = y[2] ;
      real E2 = y[3] ;
      real I1 = y[4] ;
      real I2 = y[5] ;
      real R = y[6] ;
      
      real dS_dt = -beta_0 * 0.5 * ((1 - (alpha_min / alpha_max)) * sin((2 * pi() / 365) * (t - t_max) + pi() / 2) + 1 + (alpha_min / alpha_max)) * (I1 + I2)*S/ N + lambda*R;
      real dE1_dt = beta_0 * 0.5 * ((1 - (alpha_min / alpha_max)) * sin((2 * pi() / 365) * (t - t_max) + pi() / 2) + 1 + (alpha_min / alpha_max)) * (I1 + I2)*S/ N  - 2*epsilon * E1;
      real dE2_dt = 2*epsilon*(E1 -E2);
      real dI1_dt = 2*(epsilon * E2 - mu * I1);
      real dI2_dt = 2*mu*(I1 -I2);
      real dR_dt =  2*mu * I2 - lambda*R;
      
      return {dS_dt, dE1_dt, dE2_dt, dI1_dt, dI2_dt, dR_dt};
  }
}
//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
data {
  int<lower=1> n_days;
  int<lower=1> n_train_days;
  int<lower=1> n_test_days;
  real t0;
  real ts_train[n_train_days];
  real ts_test[n_test_days];
  real mu;
  real epsilon;
  real lambda;
  real alpha_max;
  int N;
  int cases[n_train_days/7];
} // checekd
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
transformed data {
  real x_r[4]={mu, epsilon,lambda, alpha_max};
  int x_i[1] = { N };
} //checked here are the constant parameters
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
parameters {
  real<lower=0> alpha_min;
  real<lower=1> t_max;
  real<lower=0> beta_0;
  real<lower=0> e0;        // Common prior for e0_1 and e0_2
  real<lower=0> i0;        // Common prior for i0_1 and i0_2
  real<lower=0> r0;
  real<lower=0> phi_inv;
  real<lower=0, upper=1> p_reported; // proportion of infected (symptomatic) people reported
} //checked
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
transformed parameters{
  real y_train[n_train_days, 6];
  real incidence_train[n_train_days];
  real aggregated_incidence_train[n_train_days / 7]; // New aggregated incidence array
  real phi = 1. / phi_inv;

  real theta[6] = {alpha_min, t_max, beta_0, e0, i0, r0}; // Define all the theta parameters


  //y_train = integrate_ode_rk45(seeiir, rep_array(0.0, 6), t0, ts_train, theta, x_r, x_i);
  y_train = integrate_ode_rk45(seeiir, {N - (2*i0 + 2*e0 + r0), e0,e0, i0, i0, r0}, t0, ts_train, theta, x_r, x_i);
  
  for (i in 1:n_train_days){
    incidence_train[i] = 2*epsilon*y_train[i, 3] * p_reported; // the new defition of incidence in 
  }

    //for (i in 1:n_train_days){
   //  real sum_y = 0.0;
   //  for (j in 1:6)
   //  {
   //  sum_y+=y_train[i,j];
   //  }
  //   print("sum_y:", sum_y);
  //  }
  
  // Aggregate incidence for every seven days
  for (j in 1:(n_train_days / 7)) {
    int start_index = (j - 1) * 7 + 1;
    int end_index = j * 7;
    aggregated_incidence_train[j] = sum(incidence_train[start_index:end_index]);
  }
}
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
model {
 //priors
  alpha_min ~ uniform(0,1);
  t_max ~ uniform(1, 365);
  beta_0 ~ uniform(0,1);
  
  real upper_limit =N; // Upper limit as 10% of the population
  e0 ~ uniform(0, N);  // Apply the same prior to e0_1 and e0_2
  i0 ~ uniform(0, N);  // Apply the same prior to i0_1 and i0_2
  r0 ~ uniform(0, N); // Flat prior for r0
  phi_inv ~ exponential(5);
  p_reported ~ uniform(0,1);
  
  //sampling distribution
  cases[1:(n_train_days / 7)] ~ neg_binomial_2(aggregated_incidence_train, phi);
}// checked
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
generated quantities {

  real y_test[n_test_days, 6];
  real incidence_test[n_test_days];
  real aggregated_incidence_test[n_test_days / 7];
  real aggregated_incidence[n_days/7];
  real pred_cases[n_days / 7];

  y_test = integrate_ode_rk45(seeiir, y_train[n_train_days], ts_train[n_train_days], ts_test, theta, x_r, x_i);

  for (i in 1:n_test_days) {
    incidence_test[i] = 2*epsilon*y_test[i, 3] * p_reported;
  }

  for (j in 1:(n_test_days / 7)) {
    int start_index = (j - 1) * 7 + 1;
    int end_index = j * 7;
    aggregated_incidence_test[j] = sum(incidence_test[start_index:end_index]);
  }
  
  // Copy elements from array a into the new array c
   for (i in 1:size(aggregated_incidence_train)) {
      aggregated_incidence[i] = aggregated_incidence_train[i];
      }

// Copy elements from array b into the new array c, continuing after array a's elements
for (j in 1:size(aggregated_incidence_test)) {
  aggregated_incidence[size(aggregated_incidence_train) + j] = aggregated_incidence_test[j];
}
  pred_cases = neg_binomial_2_rng(aggregated_incidence, phi);
  
} 


Questions

  1. Does this approach correctly use the generated quantities block to predict future cases based on posterior samples from y_train?
  2. Is there a better way to handle y_test predictions for higher accuracy?

Thank you for your input and suggestions!

[edit: escaped code]

Hi, @Raha_ms_sci:

@charlesm93 is our expert in these models. But I’ll give it a go.

First, just an editorial comment. It helps immensely if you indent programs consistently (no tabs as who knows how they’ll be rendered) and put spaces around operators like * and -.

For question (1), yes, the looks like the right use of integrate_ode_rk45 to predict behavior going forward from the last y_train state. I find the name confusing because y_train isn’t given from the outside—it’s the expected value of y computed by the ODE in the transformed parameters block.

What you are missing is the uncertainty in y_test. There is usually measurement and modeling error included where you take the ODE to generate expectations, then generate the observed data with noise around those expectations. You want to do that with your predictions, too. I tried to explain why in this section of our User’s Guide in the section :

https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html#sampling-from-the-posterior-predictive-distribution

where I distinguish the so-called ‘aleatoric’ uncertainty (from having a noisy process) from the ‘sampling’ uncertainty (from having a finite training data set).

Maybe that’s what you’re doing with the incidence_test? I’m just used to seeing it as further sampling from the sampling distribution (the thing that generates data randomly given values of the parameters).

Dear @Bob_Carpenter,

Thank you very much for your reply—I really appreciate it! I still have some questions about my prediction setup, though. I’m new to Stan, so I apologize in advance if my questions seem basic.

To clarify a bit about my code, I’m working with a flu dataset that reports weekly incidence (number of reported cases) . My goal is to fit a double - compartment model ( SEEIIR ) to this data up to a certain week, and then make predictions for the following 4 weeks. I intend to do this repeatedly : So, first, we wait until around week 10 to have enough data, then fit the SEEIIR model to the first 6 data points, and predict the following 4 weeks. Starting in week 11, I add a new data point to the training set each week and again predict the next 4 weeks. Since the data is reported weekly, I need to align my simulation with this reporting interval by aggregating the incidence values for each week. I’ve done this by calculating aggregated weekly incidence values in my model.

Given the transmission parameters and initial conditions in the SEEIIR model, each compartment, including the flow from the E2 compartment to the I1 compartment, has a unique solution over time. This flow represents exactly what we aim to capture which I show it with I_ODE(t).

To connect this model-derived solution to the observed data, I consider I_obs(t), the number of infected people recorded at each time point. This observed count is a noisy estimate of the actual number of infected individuals. To model this observed count, we use a count distribution—the Negative Binomial—which allows us to set I_ODE(t) as the expected value while accounting for over-dispersion via the parameter phi: I_obs(t) ~ NegBin (I_ODE(t) , phi)

As shown here, I add noise to the deterministic solution of the ODE when sampling from the posterior distributions.

I know that in Stan, once the model fitting (sampling) is complete, predictions are typically made in the generated quantities block. By this point, sampling is done, and the generated quantities block performs posterior predictive checks, using the posterior distributions obtained during inference.

Using these posterior samples, we can then calculate the quantities of interest, like the weekly number of cases, within the generated quantities block. This is done with

pred_cases=neg_binomial_2_rng (aggregated_incidence, phi), which applies a Negative Binomial distribution to incorporate over-dispersion.

Finally, we can plot pred_cases along with its confidence interval to evaluate the quality of our inference. This approach makes sense to me.

My problem is somewhat different because I have unseen data (test set). I need to first generate y_ODE for the unseen data, which I have accomplished using the following line:

y_test = integrate_ode_rk45(seeiir, y_train[n_train_days], ts_train[n_train_days], ts_test, theta, x_r, x_i)

This equation uses the last point of the y_train as the initial condition. Now a question arise for me:

  • Since y_test is a deterministic solution from our simulation, we need to add noise to it. How can I do this? Can I pass it directly to the neg_binomial_2 function, similar to what I did in the model block? (I think no becuase it’s the model’s liklihood) and also because in the end, when I want to plot the aggregated incidence for both the training and test sets along with their confidence intervals, I am passing the whole incidence array to the neg_binomial_2_rng distribution.

I hope my explanation is clear, and I apologize for the lengthy message. I would appreciate your insights on whether this is a correct approach when dealing with unseen data.

I find the lengthy messages easier to answer. My usual first response is to ask for more details!

This is exactly what we were doing with UK Covid data during Covid, but we only had the audacity to predict 2 weeks ahead because Covid had a lot of external driving features. But we didn’t use SIR-type models as they weren’t responsive enough to changing conditions and the baseline versions couldn’t model heterogeneity of contact or spatial smoothing without breaking down into a bajillion compartments. They’re presumably a better fit for flu data if it’s more spatially and temporally homogeneous, but I would strongly urge you to apply posterior predictive checks and some time-series cross-validation against simpler models.

That’s the right thing to do, but Stan doesn’t give you a good way to do that other than completely refitting the model. You can use the estimates from the previous weeks as a warm start (for both step size and mass matrix and also initial draw).

That’s good for capturing unexplained variation, but it can be hard to fit. I’ve sometimes find it easier to add a random effect (negative binomial is just Poisson with a gamma-distributed random effect), even though it blows up the dimensionality of the problem.

The general answer is in this section of the User’s Guide (which has come up three times in my last two days of posting):

https://mc-stan.org/docs/stan-users-guide/posterior-prediction.html

The answer is that you add the modeled noise (in your case, exactly the beta-binomial) in the generated quantities block. The ODE will predict the expected value, then beta-binomial gives you noise. It’s just like a GLM but without the linearity—we are just using an ODE for the regression.

In case you haven’t seen it, this is a nice tutorial on SIR-type models in Stan:

Dear @Bob_Carpenter

Thanks so much for answering my questions and addressing my concerns. I really appreciate your help!