Different chains, different time

Andrea_Bizzotto · February 3, 2022, 12:19pm

Hi to everybody.
I would like to ask you an opinion, or maybe if know the answer it is better.
I am fitting a 5 age-groups model of 7 differential equations for each group, that is 35 odes system.
I have also 25 parameters to be estimated.
I am using integrate_ode_rk45 to solve the system, with: rel_tol = 1e-4, abs_tol = 1e-4, max_num_steps = 5000 .
I am running 4 chains: 500 iterations (250 warm-up, 250 sampling).
Chain 1 took 0.275 seconds for gradient evaluation, Chain 2 took 0.349 seconds, Chain 3 took 0.471 seconds and Chain 4 took 0.419 seconds.

I have observed that after 40 minutes we are in this situation:
Chain 1: 110 / 500 [ 22%] (Warmup)
Chain 2: 106 / 500 [ 21%] (Warmup)
Chain 3: 40 / 500 [ 8%] (Warmup)
Chain 4: 71 / 500 [ 14%] (Warmup)

I would like to understand why there is this large difference between Chain 3, Chain 4, Chain 2, Chain 1.
I want to stess that it is not true it depends only from the gradien evaluation, it happened this scenario with a latecomer chain, which took less time for the gradient evaluation respect to the others.

In my opinion it means that the Hamiltonian function is located in a bad region, however I don’t have idea to help it.

Moreover I set the initial values with a function, but these initial values are given with complete randomness. Could be this the problem?

Each suggestions or opinion could be very useful and interesting.

Thank you!

andrjohns · February 3, 2022, 1:30pm

You’re right that this seems to be the sampler getting ‘stuck’ in a tough region. This isn’t an issue with the initial values, as any initial values that don’t experience the issue would just be missing that region of the parameter space.

In this instance, your best bet is to either try using more informative priors (i.e., smaller variances) and/or reparameterising your model. There’s more information on Model Reparameterisation in this section of the User’s Guide: 24.7 Reparameterization | Stan User’s Guide

Andrea_Bizzotto · February 9, 2022, 1:46pm

Yeah, thank you @andrjohns !
I understand the sense of reparametrization, however my model is a SIR-type model to fit Italian Covid-19 data. All the parameters have a biological meaning. Moreover there is no way to reparametrize the various parameters.
Moreover using more informative priors help the code but affect the purpose of the model: understanding the time dynamic of the epidemic. For instance I can bound the values of a parameters; however I will find a distribution concentrated in one of the two bounds.
Moreover since the large dimensions of the ODE system, Rstan is able to find non biological meaning values of some parameters. The fit is good however I don’t have any biological informations.
Another problem encountered is due to the scaling difference between the various time-series fitted:

day incidence POSITIVE
day incidence HOSPITALIZED
day incidence INTENSIVE CARE UNIT
day incidence DEATHS
For instance is the max of positive has order of 610^2, instead the max of intensive care 610.
This means that the likelihood due to the intensive care weighs much less than the likelihood of the positives. So the fit of the positive is perfect at the expense of the fit of the intensive care.
One solution could be weights the likelihoods, what do you think ?

I write some stuff of my experience in order to ease who would like to try a calibration of a similar model.

LucC · February 9, 2022, 2:55pm

From one infectious disease modeller to another: there may still be ways to reparameterise your SIR more efficiently, despite the goals and restrictions that you describe. Also, even if an effective reparameterisation would obfuscate the biological parameters that you are interested in, you can always derive those biological parameters post-hoc and interpret them (e.g., to check that the priors on the reparameterised version of the model make sense).

If you want to consider and discuss this here, it would be useful if you could explain a little more about exactly which (groups of) parameters are now defined in the parameters block and what your biological parameters of interest are.

[EDIT]: reading suggestion for examples on reparameterisations in very simple S(E)IR models: Sensitivity of Model-Based Epidemiological Parameter Estimation to Model Assumptions

maxbiostat · February 9, 2022, 5:20pm

This.

Andrea_Bizzotto · February 9, 2022, 5:59pm

Thank you for the reply @maxbiostat.
It is my first time of calibration of a model, so it is not trivial to understand exactly what I have to do.
I read the chapter adviced by @andrjohns : 24.7 Reparameterization | Stan User’s Guide
But I can’t understand how to reparametrize my model based on ordinary differential equations.
Maybe I didn’t understand what reparametrization means.
For instance the first thing that comes to my mind is:
I have omega_i, one for each of the five groups.
Instead to estimate each omega_i, I estimate omega_1 and I define omega_j = omega_1 * f_j, and I estimate also f_j ( j =! i ).
Another reparametrization could be using the inverse of the parameters that has a value close to zero.
Instead of using d_i as dispersion parameter I can estimate k_i and define d_i = 1/k_ i.
Probably I didn’t got the sense of the need of reparametrization.

What should happen with a right reparametrization, in terms of time running and in terms of result of the fit?

LucC · February 9, 2022, 7:12pm

Thanks Andrea,

First some remarks about your schematic representation:

Individuals of stratum i who are leaving compartment I^S (bright green) have three options to go: D(eath), H(ospital), and R(ecovered). The rates \gamma^X at which individuals in I^S leave for compartment X are however multiplied by fractions (1-\omega)\phi (leaving to D), \omega (leaving to H), and (1-\omega)(1-\phi) (leaving to R). All these fraction neatly sum up to one. However, from your table I also learn that these rates \gamma^X are really three separate values of which one is to be estimated. I believe that this is incorrectly implemented (in the schematic anyway).

The approach with the fractions only makes sense if you specify a single value of \gamma at which individuals leave I^S (regardless of destination). After all, the destination is determined by the fractions (1-\omega)\phi, \omega, and (1-\omega)(1-\phi)). Alternatively, you should leave out the fractions altogether and specify the three \gamma^X parameters (and estimating one or all of them). In this case, the way people leaving I^S are split across D, H, and R is determined by (i.e., is proportional to) the relative differences between the three rates. You are now mixing up the two approaches. I cannot directly see how this messes up your estimations, but you are effectively defining three free parameters (\phi, \omega, and \gamma^D), where there is only 1 free parameter (\gamma^D).

I now also see that the fixed rate parameter \gamma_R is shared with another transition (from I^A to R_i), which forces parameters \phi and \omega to absorb and account any discrepancies between the total exit rates for compartments I^A and I_S, which is not what they are meant for. \phi and \omega are probabilities conditional on leaving I_S.

Exactly the same problem occurs with all the flows exiting compartment H, where you are multiplying path-specific rates \delta^X with a probability to actually enter that path (functions of \alpha and \psi, which are again free parameters).

I think this model is misspecified, or at least, doesn’t do what you think it’s supposed to do. In the context of this model, the priors that you set on these probabilty and rate parameters are not actually representing anything that you think you know about these parameters, but instead represent parameter values in some warped version of the parameter space.

Work either with one exit rate per compartment + probabilities to divide exiting people across destinations, or one rate per destination and calculate the probabilities post-hoc. The number of free parameters would be the same in each case.

[EDITS: typo an formatting corrections only]

LucC · February 9, 2022, 7:30pm

With regards of your formulation of the force of infection (FOI) \lambda_i acting on stratum i:

Just looking at the math, this \rho_j parameter can only represent the average reduction of infectivity of symptomatic cases in stratum j as the result of some sort of intervention or strategy (e.g., isolation). However, your table calls \rho_i a detection rate:

I check your model schematic and there is no \rho_i in it, so I think this is really an incorrect interpretation, or at least, wrong labelling of the parameter. This is not a rate!!!

Furthermore, the FOI equation seems off, and for the following reason:

I assumed that \beta_i was some overall transmission rate and that it was actually \beta_{ij} (i.e. a matrix of transmission rates between the age groups or some decomposition thereof), but it’s not. It’s a vector of age-specific “susceptibility”. First off, if this is susceptibility, where is the transmission rate parameter in this model. Second, if you are really interested in susceptibility, it is extremely unlikely that you will or can learn it from population-level data on infection numbers by age. This type of knowledge is more usually based on very detailed analyses of external data (e.g., transmission chain data).

As such, these \beta_i parameters are effectively the transmission rates in the model (assuming equal susceptibility across strata), and this implies the assumption that an individual in a particular stratum (age group) i is equally exposed to contacts from all other strata j \neq i, and that the extent of this exposure only depends on the stratum membership of the person in stratum i. This does not make sense. It imposes an (unintended) asymmetry in how individuals in stratum i infect those in stratum j and the other way around.

You need to

choose susceptibility based on external data (if at all, or assume equal susceptibility)
define a contact matrix \beta in which entry \beta_{ij} represents the transmission rate from stratum i to stratum j. You can impose all sorts of structure on this matrix, e.g., \beta_{ij} = \beta_{ji} to impose symmetrical transmission potential between strata. Or find an external source of data on contact matrices (e.g., based on the well-known POLYMOD study), define a separate parameter \beta to represent the “transmission rate per contact” to effectively scale transmission in the model.

Alternatively, start with a simpler model where you assume that the age groups mix homogeneously, such that you have only a single \beta parameter.

Andrea_Bizzotto · February 9, 2022, 8:48pm

Hi @LucC thank you for the detailed reply, I really appreciated it. I try to explain the model, it is difficult to explain everything in the chat; I hope that the dynamic will be understandable.
Susceptibles to Exposed with the FOI.
Exposed to Presymptomatics with the inverse of latent period.
Presymptomatics to Infectios compartment (Infected symptomatic (IS) and Infected Asymptomatics (IA)) with the rate eta where 1/\eta is (incubation - latent) period. Clearly \xi_i is the probability to develop symptoms (which depends by age).
For now all the parameters (except FOI) are fixed.
Now we start with the transmissions that seem more complicated.
An infected asymptomic can only heal (probability to be recovered = 1) and it takes (infectious - incubation) time = 1 / \gamma_R, which is fixed: it does not affect the behaviour of the other parameter. One can say that due to the flux between IS and R in which there is \gamma^R(1-\omega_i)(1-\phi_i), the fixed parameter could affect \omega_i, however the removed/recovered (R_i) are not fitted, so there is no “power of estimation”.
Now I try to explain what I would intend.
An infected symptomatic can leave its compartment in 3 directions (Deaths, Hospitalized, Removed).
It goes to Hospital with the probability \omega_i, the time that it takes is 1/\gamma_i^H, \gamma_i^H is fixed. If this individual is not fated to be hospitalized it goes to another compartment with probability (1-\omega_i) and it means this factor to the others two rates.
So if one individual that doesn’t go to hospital it can only die or to be removed. It dies with probability \phi_i which is fixed. To die ot takes 1/\gamma^D_i and to be recovered 1/\gamma^R.
Probably \gamma_i^D is affected from \omega_i and \psi_i and \alpha_i.

In the FOI there is the factor (1-\rho_i), where \rho_i\cdot I^S_i is the proportion of the Detected Infected. The formula that I have written is wrong it misses the contact rate and \beta_i=\beta \forall i.

LucC · February 9, 2022, 9:00pm

Hi Andrea, I understand all of what you explain as your intentions. However, the model specification is invalid. You should not combine probabilities (\phi, \omega, etc.) of leaving a compartment in a particular direction with exit rates that are specific to a particular transition out of said compartment. There are too many free parameters and the priors you think you are imposing on the probabilities and the rates actually don’t mean what you think they mean. Like I said:

So in the case of individuals compartment I^S, only define the three \gamma_X parameters for each transition to R, H, or R. OR, define the probabilities \phi and \omega to exit to particular compartments and multiply these with one overall exit rate (which is conceptually equal to \gamma^D + \gamma^H + \gamma^R). The latter approach is actually what you are (correctly) using for people leaving compartment P!!!

betanalpha · February 14, 2022, 3:49am

There are too issues being discussed here and I just wanted to try to clarify them.

Variation in time to compute Markov chains indicates that the individual Markov chains behaved differently.

This could be due to the different Markov chains all converge to the posterior distribution but are isolated to different parts of the model configuration space, for example with some parts exhibiting nastier geometry than others that requires more gradient evaluation per Markov transition.

Alternatively it can be due to the Markov chains before converging. If some Markov chains are initialized in particularly problematic regions then the adaptation of the step size and inverse metric elements for those samplers might be forced to extremely conservative values that result in very slow exploration. These issues are not uncommon when ordinary differential equations are poorly-posed near at the initializations and the numerical solvers might return imprecise gradient evaluations.

One way to determine what might be going on is to investigate the adaptation configuration in each Markov chain. The demonstrations in Identity Crisis have some examples for RStan. If the adaptation configurations are similar then it might problems within the typical set of the posterior distribution, but if they’re different then it might be problems during warmup.

Either way, one particular common cause of problematic behavior can be incorrect model specifications that result in identifiability/degeneracy issues such as those that are discussed towards the end of the thread. Fixing those specification issues would resolve many of the problems, but exactly how would depend on the nature of the problems.

Andrea_Bizzotto · March 2, 2022, 5:21pm

Thank you for your answer @betanalpha, I appreciate it.
I used this time to study the theory behind the method, which is not mandatory to use Rstan, but it becomes if have to work with it more in deep. Your videos on youtube have been usuful !
Now, I understand what you say and I am in agree with you.
I reparametrized the model, in such way to avoid much correlation between the parameters.
Moreover we weight the likelihood of some low data, to say to Rstan:“those point are important, also if they are low”, it works, we have a good fit for all the time series !
It takes about 1 day for the execution, and the results are good. I think that we can’t do better then this in time-execution terms.

Topic		Replies	Views
Stuck at warmup Modeling	11	3585	December 3, 2017
ODE with random effects running slow Modeling	0	289	November 9, 2022
Solving ODE in Stan: different computational times declared by R and Stan Modeling	6	686	December 11, 2019
Chains sampling completly different area of parameter space and never converge Modeling	8	2475	October 8, 2018
Is having 4 chains with length of 5000 the same as having 8 chains with length of 2500? Modeling	8	1574	January 28, 2022

Different chains, different time

Related topics