Brms multilevel nonlinear model - model fails to converge when converted from individual level to multilevel


#1

I am defining a hierarchical nonlinear model with a custom function using brms. It seems to do a fine job of fitting the model on an individual level, but fails dismally as soon as I take it to a multilevel fit across individuals. I am wondering whether it could be related to my specifying the brms syntax incorrectly, or what I should be trying to address the issue.

(Hopefully) relevant details:

  • The model itself is a pharmacokinetic compartmental model, which is defined by convolving two functions analytically, so the likelihood function is 24 lines long. I defined it as a function to clear up the code, but I have been concerned that something is going wrong with STAN’s engine here.
  • The data consists of about 20 measurements per individual defining a pharmacokinetic curve which increases and decreases
  • There are model weights which are defined relative to the other measurements within each individual (group). So I would like to weight within groupings, and have sigma vary by group.
  • I could fit all of the individual measurements using an nls, and they all look fine.
  • I could also fit everyone at once using a nlme, and the estimates look excellent, so it doesn’t feel like there’s anything wrong with the group or the individuals.
  • I fit the first individual using brms (i.e. not hierarchical) using the same priors as I placed over the multilevel parameters, and it got the same outcomes as the nlme for that particular individual, so the STAN model seems to be working ok.

I’ll show how I specified the models in nlme, brms (single individual) and brms (multilevel) and hopefully that might give some clues.

nlme syntax for the hierarchical MLE model

fit_nlme = nlme(midcsc ~ two_compartment(k1, k2, k3, k4, vb, 
                                                   MidTime, b_pfc, 
                                                   lambda1_pfc, lambda2_pfc, lambda3_pfc, 
                                                   A1_pfc, A2_pfc, A3_pfc, tstar_pfc, 
                                                   b_tot, 
                                                   lambda1_tot, lambda2_tot, lambda3_tot, 
                                                   A1_tot, A2_tot, A3_tot, tstar_tot), 
                data = modeldat, 
                fixed = k1 + k2 + k3 + k4 + vb ~ 1, 
                random = k1 + k2 + k3 + k4 + vb ~ 1, 
                groups = ~ ID, 
                weights = varIdent(form = ~ invweights |  ID),
                start = coef(fit_nls), 
                control = nlmeControl(maxIter = 2000,  
                                      pnlsMaxIter = 15, 
                                      msMaxIter = 200, 
                                      tolerance = 1e-5, 
                                      opt = "nlminb"), 
                verbose = F)

(Note: I inverted the weights into invweights to feed into the variance function)

brms syntax for the individual model

(In this exercise, I have mostly been trying to grok the brms syntax, so the priors are basically placed over the mean of the nls estimates, with a fairly wide margin - parameters can’t be 0. I also filtered the data to only include the first individual in this case).

twotcm_prior <- c(
  set_prior("normal(0.05, 0.05)", nlpar = "k1", lb=0, ub=0.5),
  set_prior("normal(0.2, 0.2)", nlpar = "k2", lb=0, ub=1),
  set_prior("normal(0.05, 0.05)", nlpar = "k3", lb=0, ub=0.5),
  set_prior("normal(0.02, 0.02)", nlpar = "k4", lb=0, ub=0.5),
  set_prior("normal(0.02, 0.02)", nlpar = "vb", lb=0, ub=0.2))

stantwotcm_fit_formula <- bf( midcsc|weights(weights) ~ twotcm_stan(k1, k2, k3, k4, vb, MidTime, 
                                    b_pfc, 
                                    lambda1_pfc, lambda2_pfc, lambda3_pfc, 
                                    A1_pfc, A2_pfc, A3_pfc, tstar_pfc, 
                                    b_tot, 
                                    lambda1_tot, lambda2_tot, lambda3_tot, 
                                    A1_tot, A2_tot, A3_tot, 
                                    tstar_tot, indicator),
     # Nonlinear variables
     k1 + k2 + k3 + k4 + vb ~ 1,
     # Nonlinear fit
     nl = TRUE)

stantwotcm_fit <- brm(
  stantwotcm_fit_formula,
  family=gaussian(), 
  data = modeldat_testone,
  prior = twotcm_prior,
  stan_funs = two_compartment_stan,
  control = list(adapt_delta=0.8), 
  chains = 3)

This fits pretty quickly, the traceplots look great, the diagnostics look pretty good, and the output distributions are pretty normal. It also hits pretty much the same values as the random effects from the NLME model for this individual.

brms syntax for the hierarchical model:

mltwotcm_prior <- c(
  set_prior("normal(0.05, 0.05)", nlpar = "k1", lb=0, ub=0.5),
  set_prior("normal(0.2, 0.2)", nlpar = "k2", lb=0, ub=1),
  set_prior("normal(0.05, 0.05)", nlpar = "k3", lb=0, ub=0.5),
  set_prior("normal(0.02, 0.02)", nlpar = "k4", lb=0, ub=0.5),
  set_prior("normal(0.02, 0.02)", nlpar = "vb", lb=0, ub=0.2),
  set_prior("lkj(1)", class = "cor"))

mltwotcm_fit_formula <- bf( midcsc|weights(weights) ~ twotcm_stan(k1, k2, k3, k4, vb, MidTime, 
                                    b_pfc, 
                                    lambda1_pfc, lambda2_pfc, lambda3_pfc, 
                                    A1_pfc, A2_pfc, A3_pfc, tstar_pfc, 
                                    b_tot, 
                                    lambda1_tot, lambda2_tot, lambda3_tot, 
                                    A1_tot, A2_tot, A3_tot, 
                                    tstar_tot, indicator),
     sigma ~ 0 + ID,
     # Nonlinear variables
     k1 + k2 + k3 + k4 + vb ~ 1 + (1|k|ID),
     vb ~ 1 + (1|ID),
     # Nonlinear fit
     nl = TRUE)

mltwotcm_fit <- brm(
  mltwotcm_fit_formula,
  family=gaussian(), 
  data = modeldat,
  prior = mltwotcm_prior,
  stan_funs = two_compartment_stan,
  control = list(adapt_delta=0.9), 
  chains = 1)

This runs for ages, and has Rhat values of nearly 4, and about 2 effective samples for most parameters. The traceplots look like line plots. The posteriors look worse than I realised it was possible for a posterior to even look.

The sigma ~ 0 + ID line in particular is a sticking point for me: I got it from https://magesblog.com/post/2018-01-30-pkpd-reserving-models/ , but I really don’t quite understand what I’m specifying there. I think I tried it with 1 + ID and I don’t think it worked any better.

Summary
I’m very much hoping that this all comes down to a silly specification issue in the syntax, and that’s what I’m assuming that it is. But I’m just getting started on this project, and the aim is to dig deep and eventually be specifying things in a better way, so I would also not be displeased if this is going to mean learning the ins and outs of STAN syntax, as that would make for a fun project too!

Thank you so much in advance for any help!

  • Operating System: Windows 10
  • brms Version: 2.3.1

#2

First, I would leave sigma as is and not predict it at least for now. If that still fails, you need to specify reasonable priors on the “random” effects standard deviations (via class = "sd" in set_prior). Very likely their default priors are far too wide, which may cause some problems.

Also, you seem to specify formulas for vb twice. I believe the second just overwrites the first, but it is still confusing when reading the model specification.


#3

As another recommendation: Try to make sure that your non-linear parameters are roughly of the same order of magnitute. If one is, say, between 0.001 and 0.01 and another is between 100 and 1000, Stan may have trouble sampling from the model.


#4

Thanks so much for the response!

I tried removing sigma and ran into the same issues. Also noticed that the sigma in the single individual case was concentrated around 0.02, so I tried a much more restrictive prior over sigma to no avail. I guess it must be the SD priors then.

I’ll fiddle with them sometime this week then and get back to you!

Thanks again! I really appreciate the help!


#5

Hi Paul,

Thanks so much for the help - it made everything substantially easier! After lots of fiddling, I have returned mostly victorious, but with a few smaller questions.

I’m hoping that my path of frustration, headscratching and angst might be helpful for others in a potentially similar situation, so I’ll outline it in as much detail as I can.

Things tried that failed

  • Setting priors over the SDs was really helpful, but didn’t completely save it

  • The model was not working on the whole sample, but was initially working on a subsample

  • I had identical code representing the same model working in one Rmd notebook, and not working in the other. I thought I was going mad, but it turned out to be that the randomisation seed made the difference between the model converging (Rhats = 1) and catastrophic failure (some Rhats >100)

  • Even so, one chain might converge, but the others could fail

  • Thinner or thicker priors over the intercepts had little effect. student_t and exponential priors over the SD made little difference either.

  • Messing with the correlation matrix didn’t have much effect

  • Re-adding sigma fitting consistently broke things

  • Removing weights made things a little bit more stable (I’ll get back to this)

  • Increasing the tree_depth definitely made things slightly more stable, but this vanished as soon as I tried a new seed again

  • Multiple combinations of all of the above

What finally worked

inits=“0”: this was something I was avoiding for a long while, since my main parameters could not be equal to 0: I even had boundary conditions specifying that they could only go as low as 0.001. So I guess inits=“0” respects this somehow, but I wouldn’t have known so.

Thank you for the documentation where you specify that this might be helpful for badly behaving chains. Though your documentation provides only the case of essentially constant samples in the random case, where in this case, it stabilised the chains to start low.

My guess at an explanation is that, for this model, parameter values which are too high are where posterior probability goes to die: there’s just none of it there. The model itself is comprised of ratios, so if the denominator is randomly low, and the numerator is randomly high, then it can be way out, and the gradients don’t lead it back to anywhere sensible. If you’d like, I could submit a PR to the documentation to specify that this can also be helpful when the chains’ convergence appears to be highly dependent on the random seed.

In case it’s of help to anyone in a similar situation, my final prior specification looked like this:

mltwotcm_prior <- c(
  set_prior("student_t(3, 0.05, 0.025)", nlpar = "k1", lb=0.001, ub=0.5),
  set_prior("student_t(3, 0.2, 0.1)", nlpar = "k2", lb=0.001, ub=1),
  set_prior("student_t(3, 0.05, 0.025)", nlpar = "k3", lb=0.001, ub=0.5),
  set_prior("student_t(3, 0.02, 0.01)", nlpar = "k4", lb=0.001, ub=0.5),
  set_prior("student_t(3, 0.02, 0.01)", nlpar = "vb", lb=0.001, ub=0.2),
  set_prior("lkj(2)", class = "cor"),
  set_prior("student_t(3,0,0.01)", class = "sd", nlpar="k1"),
  set_prior("student_t(3,0,0.01)", class = "sd", nlpar="k2"),
  set_prior("student_t(3,0,0.01)", class = "sd", nlpar="k3"),
  set_prior("student_t(3,0,0.01)", class = "sd", nlpar="k4"),
  set_prior("student_t(3,0,0.01)", class = "sd", nlpar="vb"),
  set_prior("student_t(3,0,0.01)", class = "sd",  dpar = "sigma"))

What I still haven’t figured out

The model and weights

This is a pharmacokinetic model for a particular neuroimaging method called PET, where we convolve a predicted impulse response function with a measured blood curve to estimate the measured signal. This is usually conducted with an NLS for each individual. What I am doing here is fitting it across individuals.

Due to properties of how we record the signal, we can derive weights which we use when fitting it as an NLS, which we expect to be proportional to the error variance that we expect in each data point within each individual. I realised quite late when working with this that weighting the posterior samples is probably the wrong way to go about this (I just saw weights and thought that sounded right, but didn’t think it through). The better way would be to describe sigma as being proportional to the inverse weights (i.e. error variance). However, some individuals will have higher error, and others less: it it only proportional within individuals.

I have succeeded in getting this model to converge (though the chains for sigma don’t look great) using sigma ~ 0 + invweights + (1 | WAYID) and set_prior("student_t(3,0,0.01)", class = "sd", dpar = "sigma"). Though actually on reflection now, I guess I should be setting it over the SD rather than the variance. But I’ll fix this. But does this sound to you like a reasonable way of specifying this? Also, as far as I can see from the stan code, the prior appears to be set over the value before it is exponentiated - am I correct here?

Outcomes

I’ve adapted my code above for changing stan_funs to stanvars for the new release, and I think this is definitely a more elegant syntax, and it should also allow me to do something I was planning to ask about before. I am fitting four pharmacokinetic rate constants (k1, k2, k3, k4), but typically we aren’t really interested in looking at them in isolation: we combine them to get the quantities of interest, such as Vt = (k1/k2)*(1+k3/k4). So being able to add stuff to generated quantities is brilliant!

It took me a little while to figure out the syntax, but I got it eventually. I’ve just managed to get this working for the intercepts thus far, but will figure out how to get the random effects working soon too. Specifying it as vectors, as well as having to use dot multiplication and division syntax was a bit confusing at first. Let me know if you want me to submit a PR with an example of how to make a genquant value for anyone else having a bit of a headscratcher here.

Next steps

I have currently been fitting the model to one region of each individual. The goal is to expand to fitting the model for each region of each individual, i.e. y within region within individual (since we expect certain parameters to be similar across regions within individuals, and others to differ). Is there a way of specifying this with brms, and if so, do you know of any other nonlinear examples online that I could make reference to?

So the idea is to have something along the lines of

     k1 + k2 + k3 + k4 ~ (1 + (1 + 1|k|ID)|REGION)

This seems like a pretty complicated use case though, so I would understand if this will finally require that I start building off the syntax of the former model and working in stan code directly.

Postscript

Thank you again for all your help! And just wanted to explicitly say that the PK/PD examples shared on the blog of Markus Gesmann (where he also acknowledges your help as being instrumental) were super, super helpful!