# Speeding up a stochastic epidemic model with ODEs

Hi everyone,

I am working on a method for fitting stochastic epidemic models to partially observed infection counts (incidence data) and, being relatively new to Stan, was hoping that some of the more experienced users might be able to suggest ways to optimize the implementation for speed. A bit of background about the problem and then the proposed method: Stochastic epidemic models are classic epidemiological tools that are used to model the time-evolution of an epidemic as individuals transition through different disease states. Often, the data will consist of some fraction of new infections accumulated in each inter-observation interval. This type of data, referred to as incidence data, is commonly encountered in disease surveillance settings. In large populations, it is common practice to represent the latent epidemic process using a deterministic system of ordinary differential equations. However, these deterministic representations ignore the stochastic aspects of the epidemic and can lead to incorrect inferences.

We are working on adapting a method called the linear noise approximation for approximate inference for stochastic epidemic models fit to incidence data. At its essence, this method approximates the transition density for the Markov jump process representation of a stochastic epidemic model with a Gaussian state space model where the moments of the transition densities are obtained by solving systems of non-linear, non-autonomous ordinary differential equations. I have included my code below for using this to fit an SIR model (a canonical stochastic epidemic model for a single, homogeneously mixing, closed population) to negative-binomial incidence counts. This model takes on the order of hours to run, but ultimately we are targeting more complex model dynamics to reflect spatial structure and population strata. In such models the since of the latent state space, represented in the code below by N_raw, grows polynomially in the number of model compartments. Thus, run times easily extend to something on the order of weeks.

In addition to the long run times, there are also a few numerical problems that are indicated in comments in the Stan code. One is that the numerically stable versions of some log differences of exponentials cause the code to crash. Another is that changing the integration method from bdf to rk45 causes issues. In more complex models, initializing the system has also proven to be quite problematic.

Any suggestions for how to optimize the code would be greatly appreciated!

Thank you!
Jon

lna_SIR_example.R (10.3 KB)

Just so everyone else doesn’t have to download and open, I’m pasting in the model with spacing set to 2:

functions {
// System of ODEs to be integrated to compute the LNA moments - state is an array with the current state
real[] lna_fcn(real t, real[] state, real[] lna_params, real[] rdummy, int[] idummy) {

// compute hazards and jacobian
vector haz;
vector phi;
matrix[2,2] jac;
real newstate;

// extract the current drift and diffusion
vector Z      = to_vector(state[1:2]);
vector neg_Z  = -Z;
vector neg_2Z = -2*Z;
vector exp_Z  = exp(Z);
matrix[2,2] V0   = to_matrix(state[3:6], 2, 2);

// compute compartment volumes
real S = lna_params - exp_Z; // S0 - N_{SI}
real I = lna_params + exp_Z - exp_Z;

// compute some quantities that will be reused
vector ito_diff = exp(neg_Z) - 0.5 * exp(neg_2Z); // exp(log(exp(-Z) - 0.5*exp(-2*Z)))
vector d_ito_diff = exp(neg_2Z) - exp(neg_Z); // exp(log(exp(-2*Z) - exp(-Z)))

// numerically stable versions of the above, but again broken. not sure why.
// real I = lna_params + exp(log_diff_exp(Z, Z)); // I0 + N_{SI} - N_{IR}

// vector ito_diff;
// vector d_ito_diff;
// ito_diff = exp(log_diff_exp(neg_Z, log(0.5) + neg_2Z));
// ito_diff = exp(log_diff_exp(neg_Z, log(0.5) + neg_2Z));
//
// d_ito_diff = exp(log_diff_exp(neg_2Z, neg_Z));
// d_ito_diff = exp(log_diff_exp(neg_2Z, neg_Z));

// compute the hazards
phi = lna_params*S*I;
phi = lna_params*I;
haz = ito_diff .* phi;

// compute the jacobian
jac[1,1] = lna_params * (d_ito_diff*S*I - ito_diff*exp_Z*I + ito_diff*S*exp_Z);
jac[1,2] = -lna_params * ito_diff * S * exp_Z;
jac[2,1] = lna_params * ito_diff * exp_Z;
jac[2,2] = lna_params * (d_ito_diff*I - ito_diff*exp_Z);

// concatenate and return
newstate[1:2] = to_array_1d(haz);
newstate[3:6] = to_array_1d(V0*jac' + jac*V0);
newstate = newstate + exp(neg_2Z)*haz;
newstate = newstate + exp(neg_2Z)*haz;

return newstate;
}

// integrate over one time interval
real[] lna_step(real t_l, real[] t_r, real[] state, real[] lna_params, real[] rdummy, int[] idummy) {

// changing the integration method to rk45 results in an error.
return integrate_ode_bdf(lna_fcn, state, t_l, t_r, lna_params, rdummy, idummy, 1e-6, 1e-6, 100000);
}

// function to map the standard normal draws to the log-LNA increments and return incidence (increments in N_SI)
vector get_lna(vector N_raw, real[] theta, vector X0, matrix stoich, vector net_effect, real[] times, real[] rdummy, int[] idummy) {
// number of times at which incidence is recorded
int n_times = size(times) - 1;

// containers current state
vector log_LNA;                      // log-LNA increment: (log(N_SI), log(N_IR))
vector nat_LNA;                      // LNA increment on the natural scale
vector c_incid = rep_vector(0.0, 2); // cumulative incidence
vector SIR_init = X0 - net_effect;   // initial value for SIR compartment volumes
vector SIR_cur = SIR_init;           // SIR compartment volumes
vector[n_times] incidence;              // increments for incidence (natural scale)
real lna_params;                     // (beta, mu, S_t, I_t, R_t)

// containers for LNA moments
real zero_6 = rep_array(0.0, 6); // for zero-ing out the state vector
real state  = zero_6;            // LNA moments - (mu, Sigma) - systems of ODEs

vector mu;         // LNA mean
matrix[2,2] Sig_chol; // Cholesky decomposition of the covariance

// initialize lna parameters
lna_params[1:2] = theta; // parameters

// map the standard normal draws to the log-LNA increments
for(k in 1:n_times) {

// set the comaprtment volumes
lna_params[3:5] = to_array_1d(SIR_cur);

// LNA transition density
state = lna_step(times[k],
times[(k+1):(k+1)],
state,
lna_params,
rdummy,
idummy);

// extract moments
mu        = to_vector(state[1:2]);
Sig_chol  = cholesky_decompose(to_matrix(state[3:6], 2, 2)); // compute the cholesky decomp.

// map N_raw values to the sampled LNA value
log_LNA = mu + Sig_chol * N_raw[(2*k - 1):(2*k)];

// save new state
nat_LNA      = exp(log_LNA); // LNA increment on the natural scale
incidence[k] = nat_LNA;   // save incidence

// update the cumulative incidence and set the new initial state
state   = zero_6;                       // reset the LNA moments
c_incid = c_incid + nat_LNA;            // update cumulative incidence
SIR_cur = SIR_init + stoich * c_incid;  // update the current compartment counts
lna_params[3:5] = to_array_1d(SIR_cur); // set the new initial values
}

return incidence;
}
}
data {
real popsize;
int n_times;          // number of census interval endpoints
real times[n_times];  // census interval endpoint times
int cases[n_times-1]; // observed incidence
matrix[3,2] stoich;   // stoichiometry matrix
vector net_effect; // net change in comparment volumes for one of each event
vector<lower=0> X0; // initial compartment volumes
}
transformed data{
real log_popsize = log(popsize);
real rdummy; // no real data arguments to LNA ODEs
int idummy;  // no integer data arguments to LNA ODEs
}
parameters {
/*
Raw model parameters:
log_R0: log(beta*N/mu)
log_mu: log recovery rate
rho:    sampling probability
phi:    negative binomial overdispersion
N_raw:  multivariate standard normal draws that drive the stochasticity
*/
real log_R0;
real log_mu;
real logit_rho;
real log_phi;

vector[2*(n_times-1)] N_raw; // initial state is fixed for now
}
transformed parameters {
real<lower=0> theta; // theta = (beta, mu)
real<lower=0,upper=1> rho;
real<lower=0> phi;
vector<lower=0>[n_times-1] lna_incidence; // expected LNA incidence

// transform parameters
theta = {exp(log_R0 + log_mu - log_popsize), exp(log_mu)};
rho   = inv_logit(logit_rho);
phi   = exp(log_phi);
lna_incidence = get_lna(N_raw, theta, X0, stoich, net_effect, times, rdummy, idummy); // map normal draws to LNA incidence path
}
model {
// parameter sampling statements
log_R0    ~ normal(0.45, 1);
log_mu    ~ normal(1.2, 1);
logit_rho ~ normal(0.1,1);
log_phi   ~ normal(2.5, 0.5);

// implies the LNA paths are multivariate normal with moments given by ODE solns
N_raw ~ normal(0,1);

// emission probabilities - parameterized by mean and overdispersion
cases ~ neg_binomial_2(rho * lna_incidence, phi);
}


For a start, it’ll be easier to read if you use compound declare-define:

 real<lower=0> theta = {exp(log_R0 + log_mu - log_popsize),
exp(log_mu)};

real<lower=0, upper=1> rho = inv_logit(logit_rho);

real<lower=0> phi = exp(log_phi);

vector<lower=0>[n_times - 1] lna_incidence
= get_lna(N_raw, theta, X0, stoich, net_effect,
times, rdummy, idummy);
}


As I was mentioning in an off-list message before @fintzij posted this publicly it’s best to keep parameters on their unconstrained scale as long as possible. For example, we have a log-parameterized neg_binomial_2, so this can skip exponentiating log_phi and just replace

cases ~ neg_binomial_2(rho * lna_incidence, phi);


with

cases ~ neg_binomial_2_log(rho * lna_incidence, log_phi);


I was urging @fintzij to post here, hoping for more replies by someone who knew these models. I can’t quite tease apart the structure from the code.

If there’s a latent state-space model somewhere, then you can often make things behave better by putting that on a non-centered paramterization (as in a continuous time series).

When you say “crash”, what happens? If Stan’s crashing the R session, there’s some kind of bug. The most an exception in the numerical code should do (like passing a negative number to a log) is to cause the curretn path to be rejected or cause initialization to fail. If there’s literally something crashing, that’s a bug.

If initialization is an issue, you can cut down the random initialization range. Arithmetic instability due to causes such as underflow, overflow, and rounding, tends to be exacerbated out in the tails.

One last comment for now.

You can also check a couple things. First, are multiple chains converging to the same adaptation parameters (mass matrix and step size or metric and integration time; I use both because @betanalpha is lobbying to change the interface language from the former to the latter)? If not, warming up longer will probably help. If the overall time is mostly in warmup, it can also help to provide some better initial values for one or more parameters.

Second, if you look at pairs of parameters, are any of them highly correlated or resulting in banana-like shapes? That can indicate only weak additive or multiplicative identifiability, and both have the effect of requiring a very low step size in order to make progress and reparameterization can help immensely.

Finally, if there are rejections due to divergences (can never remember how to extract them in R, but there’s a vignette), then steps need to be taken to improve arihmetic stability, either directly in terms of manipulating expressions or indirectly by reparameterizing.

Regarding the last response:

I ran four chains which seem to be converging to the same adaptation parameters, with 0.1, 0.5, and 0.9 quantiles for the step sizes and acceptance rates differing by less than 0.01. The average and maximum tree depths, number of leapfrog steps, and acceptance rates were also essentially the same across the chains. Run times for both warmup and sampling were stable, differing by no more than 25 minutes in warmup and 6 minutes during sampling, and ranging from 10.1 to 10.6 hours in total. The potential scale reduction factor was 1 for all parameters and latent variables. However, there were a fair number of divergences, around 100 per chain out of MCMC runs of 2,000 iterations.

There are also strong correlations among the parameters. Definitely some banana-like shapes, particularly between the negative binomial mean parameter, rho, and the basic reproductive number, which is associated with the dynamics of the latent process. I’ll look into ways to reparameterize here, but I’m not sure if there exists a natural reparameterization (as parameterizing in terms of log(R0) and log(mu) is a natural reparameterization of the per-contact infectivity rate and recovery rate in these sorts of models).

Regarding the first two replies:

I’m generally running into issues evaluating the log probability at the initial value. While this doesn’t occur when I use the “non-stable” versions of sums of exponentials, such as those used in computing the ito_diff and d_ito_diff variables, replacing those computations with the arithmetically stable versions that are commented out leads to the error. This also occurs when I fix the initial values at the true values that were used in generating the data. With that said, it would be odd if there were numerical issues here due to overflow or underflow since the latent process on its natural scale should take values between 0 and 10,000. This is confirmed when I modify the code to simulate latent trajectories in a generated quantities block.

I tried using the neg_binomial_2_log parameterization for the emission probabilities and returning the log of lna_incidence, but this leads to the same error in evaluating the log probability at the initial values. On this point, I believe that neg_binomial_2_log parameterizes the emission probabilities in terms of the log mean (here log(rho * lna_incidence)) and overdispersion, rather than the mean and log overdispersion, no? In this model, we need to exponentiate the lna_incidence at the end of each interval anyway in order to compute the initial conditions for the next time interval. However, we could just return the log of the incidence rather than the incidence and use (log(rho)+log_lna_incidence = log(rho*exp(log_lna_incidence))) as the mean in the distribution parameterized by the log mean.

As far as the structure of the model, the latent process is already implemented using a non-centered parameterization, represented here by N_raw. We obtain the centered draws by solving the systems of ODEs for the mean and covariance of the LNA transition density over each inter-observation interval, and then shifting and scaling in the usual way. That we are able to run simple models, such as the one implemented above, in a reasonable time frame (10-12 hours), is likely due to the use of this parameterization. However, run times very quickly get to be on the order of weeks once we start introducing additional model compartments, e.g., to reflect population strata. You had previously suggested (offline) that using a non-stiff solver might shorten the time per iteration. Do you know whether run times for the backward differentiation formula solver increase non-linearly with the size of the system? If so, perhaps the greatest speedup might result from figuring out why switching to the rk45 solver leads to errors in evaluating the initial log probability. I’m not quite sure how to go about debugging this though. Is there anything you could suggest on this point?

Thank you!

I have to investigate what’s wrong with log_diff_exp—it’s causing issues in other situations.

Sorry—you’re right about neg_binomial_2_log—that’s the expectation, not the overdispersion on the log scale.

Yes, the system of coupled equations in the ODE solver grows as state-size times the parameter size. It may also get harder to solve.

As far as RK45, it’s much less stable than the stiff solver in the sense that if it runs into a stiff region of parameter space, it’ll run a bunch of iterations then fail when it hits the upper limit. In some cases you need to use the BDF solver. The other alternative I suggested you might want to try is the matrix exponential—I’m not sure how it’s accuracy is going to work — @charlesm93 should know as he wrote it.

The banana shapes are particularly bad because the curvature changes orientation, so it winds up forcing the step size down and increasing the number of steps. Same with other correlations.

Hi,

The matrix exponential will not work for nonlinear equations. rk45 fails for stiff equations. You can play with the integrator’s tuning parameters to find the right balance between speed and accuracy.

There are some more advanced techniques but I’m not familiar enough with the model to say whether we can apply them. One idea would be to do within chain parallelization. @wds15 built a prototype he uses for pharmacometrics modeling. In such models, you can take advantage of the hierarchical structure of the model, due for instance to interpatient variability and solve the ODEs for each patient in parallel.

Hi!

A few points:

• Following your code I got that the initial state is always a vector of 0’s. However, the way its coded, Stan regards the inital vector as parameters (anything non-int declared inside functions is always a parameter from Stans perspective). This increases the ODE system size to 72 equations which can be reduced to only 36 (this is the ODE system soize with the sensitivity states). You would have to replace the lna_step call with
		       times[(k+1):(k+1)],
rep_array(0.0, 6),
lna_params,
rdummy,
idummy);

• I recommend you to decrease the max_steps argument to 1E3 or so. This will cause hopefully only during the warmup phase a few ODE integration errors, but nothing more.

• Also, do adapt the absolute error tolerance to reasonable values. That is, you may possibly increase this to larger values, but never go above 1E-3. This depends on your problem details.

• The relative error tolerance could even be set to smaller values down to 1E-7 or 1E-9 - I have seen cases where going to better precision lead to an improved sampling; despite the heavier ODE computation cost. Have an eye on the stepsize of the sampler to check if what you do works out (you should opt for large stepsizes).

• Should you be able to reformualte such that you do not have issues with divergences, then you can try to lower the target acceptance rate to 0.7 or. This can increase the stepsize and lead to faster sampling in problems I came across.

• It would help a lot to have a math description in order to say more

• If you can switch to the RK45 you may get a lot more speed if the system is non-stiff

• The prototype which I have coded can help you in multiple ways

1. it will generate the Jacobian of your ODE RHS system for you which gave me a 2x speedup in smaller systems. It would be interesting to see what we get in this case.
1. If you could write down the ODE solves in a way that there are indepndent calls, then this can be taken advantage of and put onto multiple CPUs. However, the way things stand now this does not look possible.

I hope these points help with the ODE integrator. However, figuring out how your problem can be reparametrized such that you have roughly independent normals would be best.

Sebastian

1 Like

Hi Charles,

I don’t see an obvious way to parallelize the system of ODEs since we are essentially in an n=1 scenario, which is often the case when analyzing epidemic data (fortunately, we do not replicate outbreaks). In more complicated systems we would also generally interested in capturing the interactions between population strata, so we generally will not be able to parallelize in more complex models either. However, it might be possible to analyze different data from different epochs in parallel (e.g., multiple seasons of flu data), so this is certainly worth keeping in mind.

Thanks

Hi Sebastian,

A couple of quick questions and then some mathematical description of the model:

• I created a few variables in the beginning of the lna_fcn that is integrated and in the get_lna function of avoiding duplicate computation (e.g., only computing exp(Z) once and then reusing it). Is it possible that in creating these variables, I am inadvertently bogging down Stan by causing it to generate additional sensitivity equations? Should I focus on eliminating variables, even if it means duplicating computation, if it will reduce the number of variables that go into the gradient calculations?

• Could you explain a bit further which sensitivity equations get generated and how they get used (or don’t get used if there are merely intermediate variables that are generated)?

Some mathematical description (apologies for the math not rendering, I’m not quite sure how to fix that but I’ve attached a pdf with the writeup below):

Stochastic epidemic models (SEMs) describe the transmission dynamics of infectious diseases. These models are often represented as density dependent Markov jump processes (MJPs), which have systems of ordinary differential equations as functional limits (as the population size tends to infinity). The canonical example of a SEM is the susceptible-infected-recovered (SIR) model for a closed and homogeneously mixing population. This model, presented above, is so called because an epidemic is represented in terms of the disease histories of individuals as they stochastically transition between susceptible, infected, and recovered states (model compartments). In this model, per-contact infectivity rate is given by \beta, so with S susceptibles and I infecteds the rate at which a new infection occurs is \beta S I. Similarly, each infected (and hence infectious) individual recovers with rate \mu, and thus the rate at which a recovery occurs is given by \mu I. A loose explanation for what we mean by density dependent in characterizing the MJP is that the dynamics of the process are driven by the fraction of the population in each compartment (i.e., we can equivalently represent the model in terms of concentrations and maintain the dynamics).

Let X=(S,I,R) be the vector of SIR compartment counts, N=(N_{SI},N_{IR} be the counting process for the cumulative numbers of infection and recovery events, and \phi(t,X)=(\beta S I, \mu I) be the vector of infection and recovery rates for the SIR model. It will be necessary in working with incidence data to reparameterize the compartment counts, and hence the vector of rates, \phi, in terms of the counting process with respect to the initial compartment counts X_0=(S_0, I_0, R_0). Let $$A = \left(\begin{array}{ccc} -1 & 1 & 0 \ 0 & -1 & 1 \end{array}\right)$$ and note that we can express the compartment concentrations at time t in terms of the initial compartment counts and the cumulative numbers of infections and recoveries using the relation, X(t)=X(0) + A^\prime N(t). Thus, we can write the rates, \phi, as
$$\phi(t,N) = \left(\begin{array}{c}\beta (S_0 - N_{SI})(I_0 + N_{SI} - N_{IR}) \ \mu(I_0 + N_{SI} - N_{IR})\end{array}\right).$$

In the infinite population limit, the counting process for the numbers of infections and recoveries evolves according to the deterministic path obtained by integrating the following system of ODEs:
$$\frac{\mathrm{d}N(t)}{\mathrm{d}t} = \phi(t, N),\hspace{0.2in} N_0 = (0,0),\ X_0 = (S_0, I_0, R_0).$$

Clearly, the dynamics of the limiting ODE do not reflect the stochastic aspects of the time-evolution of the MJP. However, in the large population limit, it is possible to approximate the transition density of the MJP with the following Ito diffusion:
$$\mathrm{d}N(t) = \phi(t,N)\mathrm{d}t + \sqrt{\Phi(t,N)}\mathrm{d}W(t),\ s.t., N(0)=0,\ X_0=(S_0, I_0, R_0).$$
where \Phi(t,N)=\mathrm{diag}(\phi(t,N)), and W(t) is a 2-dimensional (as we have two elementary transition events, infections and recoveries) standard Brownian motion.

Since the counting process N is positive and monotonically increasing, it will be advantageous to work with the log-transformed SDE for the log-transformed process, Z = (Z_{SI},Z_{IR}) = (\log(N_{SI}),\log(N_{IR})). However, we should proceed with caution as there is an obvious problem with the initial condition for N_0, as \log(0)=-\infty. We can resolve this by setting N_0=(1,1) and “borrowing” events from X_0 in the following way: Note that X(t)=X(0) + A^\prime N(t) = X(0) + A^\prime(N(t)-\mathbf{1} + \mathbf{1}) = (X(0)-A^\prime\mathbf{1}) + A^\prime(N(t)+\mathbf{1}) for all t. Thus, an equivalent set of initial conditions for the SDE is given by N_0=(1,1),\ X_0=(S_0,I_0,R_0)-A^\prime\mathbf{1}\equiv\widetilde{X}_0, as long as \widetilde{X}_0 \geq 0. In our context, this is not an unreasonable condition as it implies, for example, that there are not more infection events than there are initial susceptibles.

By Ito’s lemma the SDE for the log-transformed process, Z, subject to initial conditions Z_0=(0,0),\ X_0=\widetilde{X}_0, is given by
$$\mathrm{d}Z(t) = \left[\mathrm{diag}\left(\exp(-Z(t))\right) - \frac{1}{2}\mathrm{diag}\left(\exp(-2Z(t))\right)\right]\phi(t,\exp(Z(t))) + \mathrm{diag}\left(\exp(-Z(t))\right)\sqrt{\Phi(t,\exp(Z(t)))}\mathrm{d}W(t).$$

The LNA is derived by Taylor expanding the drift and diffusion terms in the above SDE about the deterministic limit of the MJP. The resulting approximation decomposes the Ito SDE into a non-linear deterministic drift and a linear SDE for the fluctuations about the drift. This linear SDE has an analytic solution that is given by a multivariate Gaussian, the moments of which are obtained by solving a coupled, non-autonomous system of ODEs. The LNA transition density over the interval [t_{\ell-1}, t_\ell] thus takes the form, $$Z(t_\ell)|Z(t_{\ell-1}) = \boldsymbol{\eta}(t_\ell) + \mathbf{m}\left (Z(t_{\ell-1}) - \boldsymbol{\eta}(t_{\ell-1})\right ) + \boldsymbol{\epsilon}\ell,\hspace{0.25cm} \boldsymbol{\epsilon}\ell \sim N(0,\boldsymbol{\Sigma}(t_\ell)),$$
where

• \boldsymbol{\eta}(t_\ell) is the solution to \frac{\mathrm{d}\boldsymbol{\eta}(t)}{\mathrm{d}t} = \left [\mathrm{diag}\left (\mathrm{e}^{-\boldsymbol{\eta}(t)}\right ) - \frac{1}{2}\mathrm{diag}\left (\mathrm{e}^{-2\boldsymbol{\eta}(t)}\right )\right]\boldsymbol{\Phi}\left (\mathrm{e}^{\boldsymbol{\eta}(t)}\right ) \equiv\xi(\boldsymbol{\eta}(t)).
• \mathbf{m}(Z(t_{\ell-1}) - \boldsymbol{\eta}(t_{\ell-1})) is the solution to \frac{\mathrm{d} \mathbf{m}(t)}{\mathrm{d}t} = \mathbf{F}(t)\mathbf{m}(t), where \mathbf{F}(t) = \left (\frac{\partial\xi(Z(t))}{\partial\mathbf{Z}(t)}\right )\Big|_{\boldsymbol{\eta}(t)}.
• \boldsymbol{\Sigma}(t) is the solution to \frac{\mathrm{d}\boldsymbol{\Sigma}(t)}{\mathrm{d}t} = \mathbf{F}(t)\boldsymbol{\Sigma}(t) + \boldsymbol{\Sigma}(t)\mathbf{F}(t)^\prime + \mathrm{diag}\left (\mathrm{e}^{-\boldsymbol{\eta}(t)}\right )\boldsymbol{\Phi}\left (\boldsymbol{\eta}(t)\right )\mathrm{diag}\left (\mathrm{e}^{-\boldsymbol{\eta}(t)}\right ).
• The above ODEs are subject to the initial conditions: \mathbf{X}(t_{\ell-1}) = \mathbf{x}_0 - A^\prime\mathbf{1} + A^\prime\exp(Z(t_{\ell-1})),\ \boldsymbol{\eta}(t_{\ell-1}) = \mathbf{0},\ \mathbf{m}(t_{\ell-1}) = \mathbf{0}\ \left (\implies \mathbf{m}(t_\ell)=\mathbf{0}\right ), and \boldsymbol{\Sigma}(t_{\ell-1}) = \mathbf{0}.

In practice, we will use a non-centered parameterization for the LNA representation of the latent epidemic process in which we map a vector of independent Gaussian draws, \mathbf{N_{raw}}\overset{i.i.d.}{\sim} N(0,1), deterministically to a latent LNA path by noting that Z(t) \overset{\mathcal{L}}{=}\eta(t_\ell) + \Sigma(t_\ell)N_{raw}(t_\ell). Furthermore, we reset the initial conditions for the compartment volumes to reflect the cumulative numbers of infections and recoveries up to the beginning of each observation interval, and reset the counting process for the infections and recoveries. The LNA moments for the log transformed process for the next interval are then computed, and we obtain positive increments in the numbers of infections and recoveries by exponentiating the log-LNA values that were obtained via the deterministic mapping.

Note that the latent LNA process is Markov. Thus, the observed data likelihood, conditional on a latent LNA path, decomposes into a product of emission probabilities, which in the SIR example in this post is a product of negative binomial probabilities.

LNA_stan_explanation.pdf (195.1 KB)

Hi!

I am in a hurry so a few quick ones for now:

• It’s a good thing to save temporary expressions which are used multiple times. This is what I got from Bob as you cache not only the expression itself, but the gradient as well.

• Stan needs gradients so that NUTS works. That means we not only need the solution of the ODE which is some function f(t), but we also need the gradients of that function f(t) wrt. to all the parameters. Have a look at the arXiv paper on stan’s autodiff where it is explained how that sensitivity system is formed. In a nutshell, if you have N states in your ODE (you have N=6) and you have P parameters, then Stan will have to solve N * (P + 1) ODEs in total. P includes all parameters which means that if your initials are considered as parameters, then this adds into it in addition to the parameters which go into the ODE integrator as “theta” variable.

Thanks for the description, I hope I can make sense of it (later).

Sebastian