The Monster Model | Hierarchical ODE models

So I was actually looking around for something else (scalability/massive parallelism) when I stumbled across this thread: Scalable Bayesian multilevel modeling - #31 by andrewgelman

As I understood the model/paper it describes a combination of a parmacokinetic/toxicokinetic model, a population model and a measurement model, so nothing especially fancy.

I was wondering whether this (or something similar) has made it into a case study yet? Or did difficulties occur with this particular model (which I guess would improve a case study)? In sum, I’m looking for “the” ideal workflow and common pitfalls for hierarchical ODE models.

Searching the internet for “stan monster model” was only semi-successful, as was searching the forums.


@charlesm93 has worked on this model quite a bit as I recall. I would hope that the adjoint solver is helpful here to make progress more quickly, but the model is complex was my impression.

1 Like

This is indeed one of my pending projects with @andrewgelman . It has a lot of moving parts: a stiff ODE that is highly sensitive to parameter values, sparse data, a low number of patients (which makes it it difficult to estimate inter-individual variability), hierarchies across multiple parameters, etc. The workflow is complicated by the fact each model fit takes several hours, even after firing up the cluster and parallelizing within chains.

I fitted simplifications of the model and got reasonable results. Fixing one or two parameters, or the population variance greatly simplifies the problem. The full model returns divergent transitions and I haven’t found the right parameterization. I tried monolithic centering and non-centering, and things in between. I now suspect the problematic geometry may not be only due to the interaction between population and patient parameters. It’s indeed possible to have a funnel between population parameters, in which case no parameterization can help you. Divergences aside, I don’t trust the estimated parameters, because there are somewhat inconsistent with the priors which were very carefully constructed. Truncating priors, as was done in the original paper, limits our ability to reparameterize the model. When using truncation, I find a lot of the probability mass concentrates at the boundary (note that on the unconstrained scale, you actually have infinite volume near the boundary). All that said, you shouldn’t need to truncate the priors, because the truncation occurs at the tail of the prior distributions.

It’s also worth pointing the original paper didn’t use HMC, so they didn’t have access to diagnostics such as divergent transitions. You can also imagine how an accept-reject sampler will interact differently with a truncated distribution…

To sum up, I agree this would make a very good case study and writing it has been my ambition for some time. But it’s a challenging model to fit, with behaviors I don’t fully understand (yet). There exist other examples of hierarchical ODE models, built and used with Stan, which can be considered, if the goal is to write a case study. But yes, I think we have a lot to learn from the Monster model.


Thank you @charlesm93 very much for the elaborate response, this is much more than I was hoping for :)

I was wondering whether we could actually expect the model to “just work”, there was this sentence in the paper that made me a bit suspicious:

after 15,000 iterations, most [\sqrt{\hat{R}}-values] were less than 1.1. The highest value is 1.36, […]

and as you say, they did not use HMC, but “a variant of the Gibbs sampler”, which I think should be much less efficient than HMC, so I don’t quite know what the 15k or 5x50k iterations correspond to, in addition to the missing built-in diagnostics.

The other thing that made me suspicious was that the first mention of someone (you) working on it was mid/end 2019, but I could not find a case study.

I do have several goals. My first and foremost is to learn how to fit models like the Monster model (efficiently). Then, I would like to try out some ideas I had that might help speed things up and that might help debug more quickly at the same time. A model where a single, even preliminary run takes several hours must pose a serious challenge to “the” iterative Bayesian workflow. Of course, all these things get simplified a lot if there already exists some sort of reference solution from which to learn and against which to compare.

I would be very interested in what you have tried so far, and what you have determined works well or not so well.

Thank you again!


Ah, also, @wds15: I thought the per person ODEs were not that horrible? Though, on second thought, I think they do have 16 parameters each?

1 Like

So what is actually even going on per person? Meaning, what is the ODE that (supposedly) models the process?

I understood this as follows, but not everything adds up for me:

  • From exposure to measurement Cair (concentration in ambient air) switches (is piecewise constant). Calveolar ( concentration in alveolar air) is just Cair?
  • Between Calveolar and Cart (concentration in arterial blood) we have instantaneous equilibrium, i.e. dCart/dT=Cart-Cair/Pba=0?
  • Cvenous (concentration in venous blood) is just the weighted combination of F (blood flow fractions per compartment) with C (concentration per compartment), i.e. Cvenous=dot_product(F,C)? This gets measured?
  • Within the compartments we just have the linear ODE dC/dt=(Cart - C/P)*F/V, except for the liver-compartment where we have the additional term Vmax*C/[V*(K+C)]

With this you can almost set everything up, except:

  • What’s going on with the units? Where are they, what are they? t gets measured in what, hours? The figures make it look so, but I don’t think it’s stated explicitly.
  • What’s the function of VPR?
  • The sum of the given volumes is restricted to .873(?), but we don’t have a volume for the fat? Supposedly this gets measured?
  • The concentration in exhaled air is also supposed to be measured. But isn’t this just the venous blood concentration?

Ignoring VPR (and the summation constraints) I can easily fit this model (for simulated data) with (I think) the provided priors and truncated parameters within the order of minute(s), so I strongly suspect I’m missing something crucial:

What’s going on there?


@Funko_Unko Give me your github credential and I’ll share a (private) repository with an attempt at the model. I had to dig through the old code and pick Andrew’s brain to get a grasp on what was going on.


Cool, that would be just @funko-unko, but I’ll also send you a PM.

This probably went as swimmingly as me trying to reconstruct the model from the paper. Makes you appreciate when people “just” provide the Stan file and what Stan must have done for reproducibility.


Okay, so with the approval of @charlesm93 I’ll post the ODE describing the pharmacokinetic model. I’ll tack on a short derivation from first principles for good measure and I’ll tag people that probably could tell me at once if it is absurd: @wds15, @yizhang, @andrewgelman

It’s actually not that wild.

First: In addition to the parameters in Table 1 of the original paper, there are four additional parameters that are measured and assumed to be known “exactly”:

  • The lean body mass [kg],
  • The mass fraction of fat [N/A],
  • The pulmonary volume flow [l/min],
  • The concentration of PERC in the air during exposure [PPM] with 1PPM=409/144 mug/l.

From the lean body mass [kg] and the mass fraction of fat [N/A] we can compute the total volume of fat [l] from the density of fat (.92 g/ml). The body of each person gets modeled as four distinct compartments which are connected with each other via the circulatory system, mixing the blood “coming out” of each compartment, then passing through the lungs and mixing with in/exhaled air, and then again feeding the blood into each compartment. As described in the original paper, section 2.1, the concentration C_s [mg/l] in each compartment follows the ODE


except for the liver compartment where we have the additional mass sink


The latter term is just the Michaelis–Menten ansatz, while the first term derives from the conservation of mass for each compartment. Namely, for the mass in each compartment M_s=C_sV_s [mg] from mass conservation (rate of change in mass = mass flow in - mass flow out) we have


where F_s [l/min] is the blood volume flow per compartment, C_{art} [mg/l] is the PERC concentration in the artertial blood (coming from the lungs) and C_{out,s} [mg/l] is the PERC concentration in the blood exiting the compartment. Assuming equilibrium at the compartment exit we get


which together with the mass balance yields the original ODE.

There are now still three parts missing. We still need

  • an expression for C_{art} [mg/l] (the PERC concentration in the blood exiting the lungs),
  • an expression for C_{venous} [mg/l] (the PERC concentration in the mixed blood leaving the compartments and entering the lungs) and
  • an expression for C_{exhaled} [mg/l] (the PERC concentration in the exhaled air).

The latter two quantities get measured (with lognormal noise).

The simplest part is the expression for C_{venous}. This is just the concentration of the mixed blood flows exiting the compartments, i.e.

C_{venous} = \sum_s C_{out,s}F_s/F_{venous}

with the total blood flow F_{venous} = \sum_s F_s.

The concentration in the exhaled air can be derived from the assumption of equilibrium between “alveolar air […] and arterial blood” (section 2.1). Therefore we just get

C_{exhaled} = C_{art} / P_{blood/air}.

Finally, C_{art} we get again from conservation of mass. The mass flow of PERC entering the lungs has to be equal to the mass flow exiting the lungs. Crucially, there are two ways that PERC can enter and exit the lungs, via the inhaled air+the incoming venous blood or via the exhaled air+the outgoing venous blood. The mass flow [mg/min] is just the concentration [mg/l] times the volume flow[l/min], hence we get


With F_{venous}=F_{art} (volume flow of PERC is negligible compared to volume flow of blood) and plugging in C_{exhaled} = C_{art} / P_{blood/air} we get

C_{art} = \frac{C_{inhaled}F_{alveolar}+C_{venous}F_{venous}}{F_{alveolar}/P_{blood/air}+F_{venous}}.

There are actually still some things missing.

First, F_{alveolar} gets computed from the given pulmonary volume flow via the magical constant .7 as F_{alveolar}=.7F_{pulmonary}.

Furthermore, F_{venous} gets computed using the Ventilation/perfusion ratio (VPR) from F_{alveolar}
as F_{venous}=F_{alveolar}/(VPR).

Stay tuned or not while I update this post.

There are two more magical things we have to know before we can hopefully reproduce the results from the paper.

  • Apparently, the V_{max} [mg/min] from the equations is not the maximal metabolic rate from table 1. Instead, VMI from table 1 has units [mg/min / kg^(-.7)], such that
    V_{max} = \mathrm{VMI}*\mathrm{bodyWeight}^{0.7}.
  • The relative volumes of the four compartments excepting the fat sum to a magically precise .837 (relative to the lean body volume).

The first point makes me wonder whether I’ve got the units right for VMI ([mg/min]?) and KMI([mg/l] to match the C_s?).

As for reproducing the results from the paper, we’ll not do this (yet). For now, consider that

and that

Before we try to fit the full model it might be helpful to look at the ODE a little bit more closely. The whole ODE reads
dC_s/dt=(C_{art}-C_s/P_s)F_s/V_s-\delta_{s,4} V_{max}C_s/(V_s(K_m+C_s))
where s runs from 1 to 4 and \delta_{i,j}=\mathrm{int}(i==j).

There are two things which are not immediately obvious from this formulation. Neglecting the Michaelis-Menten term

  • this is a linear ODE and
  • there should be a perfect anticorrelation between the volumes V_s and the partition coefficients P_s.

Claim 1:
The only term in the ODE without the Michaelis-Menten term that could potentially be nonlinear is C_{art}. However, it’s not:
The only term that depends on our time-dependent state C_s is C_{venous}, which in turn is just
C_{venous}=\sum_s C_{out,s}F_s/F_{venous}=\sum_k C_{k}F_k/(F_{venous}P_k).
All in all we get
dC_s/dt = (\frac{C_{inhaled}F_{alveolar}+\sum_k C_{k}F_k/P_k}{F_{alveolar}/P_{blood/air}+F_{venous}}-C_s/P_s)F_s/V_s
which is linear in the state C_s. We may write the ODE as
dC_s/dt = \sum_k A_{sk}C_k+f_s
with constant source term (which is zero after exposure)
f_s = \frac{C_{inhaled}F_{alveolar}}{F_{alveolar}/P_{blood/air}+F_{venous}}F_s/V_s
and matrix
Claim 2:
The second claim becomes clear once we reformulate the ODE in terms of C_{out,s}=C_s/P_s. We get
dC_{out,s}/dt=dC_s/dt/P_s = \sum_k B_{sk}C_{out,k}+g_s
with new source term
g_s = f_s/P_s = \frac{F_s}{V_s P_s}\frac{C_{inhaled}F_{alveolar}}{F_{alveolar}/P_{blood/air}+F_{venous}}
and new matrix
B_{sk}=A_{sk}\frac{P_k}{P_s}=\frac{F_s}{V_s P_s}(\frac{F_k}{F_{alveolar}/P_{blood/air}+F_{venous}}-\delta_{s,k}).

Stay tuned for the next part where we reveal

  • the exact solution of the linear ODE
  • an attempt at fitting the linear ODE with the original parametrization
  • a hopefully successful attempt at fitting a reduced linear ODE
  • a tool to quickly explore the full, nonlinear ODE and
  • a hopefully successful and quick fit of a reduced nonlinear ODE.

I’d be soooo happy if someone could get the Monster model running in Stan. It’s a great example but it would be so much better if it were runnable. Also it’s buggin the hell out of me that we could it just fine using Metropolis 25 years ago but now we’re getting stuck when using a much better algorithm on computers that are a zillion times faster. So I’d really really really like to get this one straightened out. Anyone who can do this, I promise you a free homemade loaf of bread, an advance look at some blog posts 6 months in the future, and pretty much whatever else you want from me.


Hm, just a short, not-yet-bread-worthy update. I’ll have to work on something else for now, but will revisit this at some later point. Everything that follows has been tested with just a few simulated data sets with the true parameters drawn from the population posterior as specified in the original paper, table 1, so take it with a grain of salt:

  • The linear ODE has a closed form solution, see e.g. Wikipedia
  • The linear ODE with the original parametrization actually fits (almost) fine with and without truncation. For large measurement errors, things are significantly easier than if the measurement is precise. However, for “precise” (log-std < .1) measurement, there is some unexpected “true” multimodality in the population-std parameters. “True” in the sense that we observe different modes in the marginal distributions, chains get stuck in those modes, but Rhat(lp__) is fine (<1.01). There may be a bug in my code.
  • I’m actually not sure whether the reduced linear ODE fits better or worse than the original ODE. It fails or works for the same problems. But I think I remember that it worked better than the original ODE, maybe it was just lucky before?
  • Both linear ODEs fit in the order of seconds or minutes, depending on the magnitude of the measurement noise. The difference in runtime is due to much larger n_leapfrog__ for precise measurements, as then the posterior is “more constrained”.
  • For any of the ODEs, the marginal distributions of many of the parameters do not change much.
  • To cheaply fit the full nonlinear ODE we can in principle make use of a Strang splitting. In practice, there are some floating point overflow issues, so instead of a splitting of the flow into two exact flows, I currently use the exact flow of the linear part + an approximation of the flow of the Michaelis-Menten part obtained via a single step of the implicit Euler method, which can be written down explicitly
  • We can monitor the convergence of our splitting method by recomputing lp__ (e.g. using `log_prob_grad` method for CmdStan mimicking its `generate_quantities` method) for the whole fit with more sub-steps. With this we can estimate neff and decide whether we need more sub-steps. For the whole of the prior we need quite a few sub-steps to be sufficiently precise, but the prior is very wide. For the whole of the posterior we need significantly fewer steps. How many steps we will need depends on the parameter values (larger VMI and KMI require more sub-steps because a linear approximation gets worse) and on the magnitude of the measurement noise (larger measurement noises allow coarser approximations).

There are a few more ideas I’ll try when I come back to this problem.

Ah, I forgot this:

  • To cheaply explore the qualitative behavior of the posterior of the nonlinear model, we may just use a fixed number of sub-steps for the data-generation and fitting. Then fitting the approximation of the full nonlinear model which hopefully behaves qualitatively the same as the true nonlinear model takes again in the order of seconds to minutes.

Oh, I also forgot to attach the obligatory excessively large picture:

Blue/Red: Prior/Posterior
The population-std of Pba misbehaves.
There are two divergences that can be eliminated, but I broke the part of my code that does this.

Almost everything should be as in the paper, except for

  • the prior on the population-stds, which is lognormal with std 1/2. The picture looks slightly worse with the original priors.
  • All timepoints are equidistant, including the exposure start and stop time (this allows us to reuse the matrix exponential) and at all timepoints blood and air concentration get measured
  • The data are synthetic, obtained via bdf with abstol=reltol=1e-12.
  • There’s no truncation.

(The prior on the measurement noise appears different than in the paper because we sample from the prior, but this doesn’t work with the original prior. For fitting we use the original prior.)

We used 8 sub-steps, which appears to have been enough. It looks like the true parameters (black vertical lines) are recovered quite nicely.

Fitting took ~ 3 minutes. Rhat(lp__) < 1.01, E-BFMI ~= 1.

1 Like

Turns out there was a bug in my code. Namely, I accidentally constrained the geometric standard deviation population_eS to be positive, instead of greater than 1. With the personwise parameters obtained from a unit normal via

person_params[person] = (
  population_eM .* pow(population_eS, unit_log_person_params[person])

this indeed led to the observed non-identifiability, as

pow(population_eS, unit_log_person_params[person]) == pow(1/population_eS, -unit_log_person_params[person])

Fixing this, the population-gstd priors and slightly changing the parametrization such that

vector<lower=1>[no_latent_params] population_eS = exp(
    exp(log_log_population_eS_mu + unit_log_population_eS ./ population_eS_nu)

makes the model run quite nicely:

Observe that the average number of leap frog steps dropped “dramatically” from ~316 per transition to only 61. All Rhat values are fine (<1.01), there are 8 divergences but this (I believe) is only because my timestep adaptation for the sampling is overly optimistic, and the divergences should be able to be eliminated.

I fixed some other thing (forgot the scaling of VMI) and now we also observe the shift towards the true lower values for VMI. There is one more thing that confuses me, namely the units of KMI. Figure 4d in the paper makes it look like it has units mg, but to match the equation from the paper it should have the same units as a concentration (mg/l)?

Maybe @andrewgelman could clarify what’s going on with KMI? I think I’m missing something there, without which I can’t build the original model.