Stan hangs for my code after finishing a chain; did not do this for earlier versions with close to identical code

Hi all,

I’m re-running some older Stan codes after a few months to confirm some results for a manuscript review and am now experiencing some issues that I did not experience before. I previously ran the following scripts on a CentOS cluster at my university that ran R 3.4.1 and Stan 2.19.1 last July of 2019. In terms of some cursory hardware specs, the core was a first-gen 64-core AMD EPYC with 512 GB of RAM. stan_adriana_pools6_50.r is an R script that calls Stan to sample posteriors for AWB and CON ODE models conditional on the adriana_data_s_min.csv data set and then compares their out-of-sample predictive accuracy using the loo package.

stan_adriana_pools6_50.r (26.9 KB)
AWB_adriana_pools6.stan (8.6 KB)
CON_adriana_pools6.stan (6.9 KB)
adriana_data_s_min.csv (760 Bytes)

In July of 2019, the scripts ran fine. The larger AWB model ran much more slowly than CON and had some divergent transitions because of some model stability and parameterization issues that resulted in suboptimal geometry, but I was able to obtain posteriors for both models and use loo.

With the COVID pandemic resulting in work-from-home arrangements in the U.S., the cluster at my university has some hardware issues that can’t be immediately fixed, and so I’ve had to try running the scripts on my Macbook Pro. I am running R 3.6.3 with Stan 2.22.1 on OSX Catalina 10.15.4. CON works fine, but monitoring AWB, AWB now hangs for an unpredictably long amount of time after each chain is complete. In rarer cases, things proceed to the next chain after hanging from 1 - 2 hours, but I’ve usually been killing the process after things have hung for 5 - 6 hours without moving to the next chain. In comparison, when I ran AWB before on the cluster, I was waiting about 10 - 20 minutes after a chain was complete for Stan’s post-chain processing before things continued.

I originally ran last year with adapt_delta = 0.9995, stepsize = 0.001, and max_treedepth = 15. I’ve been experimenting with higher and smaller HMC parameters, but that has not seemed to be able to stop the hanging. For the most substantive changes I’ve made to my code, I added a corrected log_lik vector calculation in generated quantities block as the previous way I computed log_lik was wrong. I also changed integrate_ode_rk45 to integrate_ode_bdf for my ODE integration. I don’t think the ODE integration is the problem, as it still hangs even when I switch between rk45 and bdf.

Is this issue related to the divergent transitions I experienced with AWB my previous time running the scripts last year? Or could this also result from OSX Catalina issues? I’m a bit at a loss, so any help would be greatly appreciated, thank you.

Hi, it looks like your question slipped through - did you manage to resolve it? At first glance I would suspect the cost of transferring data between R and Stan might be substantial, so that might contribute. I suggest you try with the cmdstanr package, and see if the problem persits there. See https://mc-stan.org/cmdstanr/articles/cmdstanr.html on how to fit in CmdStanR and then convert the results to a stanfit object that should work with all your previous analysis code.

Also:

if you were unable to sample without such extreme settings, this very likely indicates some problem with your model, potentially even damaging your inferences (are your Rhat and n_eff OK?). I guess that mid-review is not the best time to debug the model, but I would guess you could get the model to run much faster if you found the source of the computational issues the force the parameters.

Best of luck with your model!

1 Like

Hi Martin,

Thank you for your follow up. I was not able to fully resolve it, but downgrading back to Stan 2.18.1 and R 3.4.1 has allowed the hanging to stop. People going through to make sure all questions are answered is one of the things I love about this community.

For my ensuing project, I’m going to be reformulating the ODE model that is experiencing issues and will go back to Stan newer versions of R (3.6.xx) and Stan (2.20.xx). I will keep the CmdStanR tip in mind.

The presence of divergences is not ideal and indicates some damaged inferences, but I think the damage may be more modest. I have been able to de-extremize my adapt delta, step size, and max tree depth respectively to 0.99, 0.1, and 12 without generating more divergences. Having an adapt delta lower than 0.9 seems to correspond to more divergences more than the baseline that seems to come no matter what with the geometry of this model, but am still leaving things at 0.99. From my understanding, the step size parameter is only the initial step size and is tuned subject to adapt delta, is that correct?

The k-diagnostics output from loo have been okay. My Rhats are all finishing around 1, and my n_eff / n ratios have generally range from 0.7 - 0.9.

Just odd that the hanging does not come in the older versions of R and Stan. Could any other updates to other dependent packages, such as rcpp changed things up?

Unfortunately, this is probably beyond my understanding, but maybe some of the more active devs (say @bgoodri or @stevebronder ?) could know more. You could also see if/how the C++ code generated by Stan has changed between the versions - you should be able to get the code as

model <- rstan::stan_model(file = "model.stan")
cat(model@model_cpp$model_cppcode)

I think it is likely that there are changes you could make to your model to make it run with less extreme settings and faster without divergences. I’ve skimmed the Stan files and I won’t pretend I know what is going on, but my limited experience with ODE models is that often not all parameters are identified or there are some weird relationships between them. Since you seem to need hard bounds on your parameters, I would guess you are hitting some of those issues (e.g. I would hazard a guess that as DOC to SOC transfer coefficient goes near zero, the SOC to DOC transfer coefficient can attain almost any value without changing the final ODE solution noticeably) Clever reparametrizations can help, but those usually require deep dives into the model math, so it may not be something you want to invest your time in.

For reference, here is a short paper I wrote about a (completely different) ODE model and the problems with the “default” parametrization as used in the literature: https://www.biorxiv.org/content/10.1101/352070v1.full-text

Alternatively, the hard bounds could be causing the fitting issues and replacing them with only the physically necessary hard bounds (e.g. <lower=0, upper=1> for the transfer coefficients <lower=0> for the activation energies) and then add priors to softly guide the model away from configurations you don’t consider plausible could also help.

Best of luck with your model!

1 Like