Adapt_delta for each entry of a vector: Fitting multiple datasets in parallel, divergences together, clean run when separate

Hello!

I’ve been trying to fit the mean and standard deviation of a variable \textrm{var} linearly over time, simultaneously. Here’s my model
\textrm{var}\sim\textrm{Normal}\Big(\mu_\textrm{slope}t+\mu_\textrm{intercept}, \textrm{softplus}(\sigma_\textrm{slope}t+\sigma_\textrm{intercept})\Big)
Where \textrm{softplus}(x) is the function \log(1+e^x) to ensure positivity of the standard deviation.

I have a dataset which has data points for \textrm{var} over time for several groups. Very simply, I want to fit \mu_\textrm{slope}, \mu_\textrm{intercept}, \sigma_\textrm{slope}, and \sigma_\textrm{intercept} for each group separately. (Later, I will add predictive factors and all of that sort of thing. We aren’t there yet, clearly.)

When I run the entire dataset, making the slopes and intercepts vectors, I get thousands of divergent warnings. But on each group individually, it works fine. I even managed to find a set of 4 group where 1+2, 1+3, and 1+4 are all perfectly fine, 2+3+4 is perfectly fine, but 1+2+3+4 is a trainwreck.

My guess is that there’s some adapt_delta nonsense going on—for some group, it needs a larger stepsize, and for other datasets, it needs a smaller stepsize. Individually, it chooses the right stepsize and moves on. But when all together, it can’t find any stepsize because it needs to do different stepsizes for each entry.

  1. Am I interpreting this correctly?
  2. Is there a way to free up Stan to fit different stepsizes for each entry?
  3. Is there another, more elegant solution that would fix this?
  4. Is this a lack of warm-ups problem? (I gave it as much as I thought my computer could take.)

Thank you!

Hi @maxgotts.

  1. Am I interpreting this correctly?
  2. Is there a way to free up Stan to fit different stepsizes for each entry?
  3. Is there another, more elegant solution that would fix this?
  4. Is this a lack of warm-ups problem? (I gave it as much as I thought my computer could take.)

For point 1 having a full working example would help here in trying to diagnose what’s going on (Stan model definition, example data, and details of what interface you are using and any changes to default sampler parameters)!

For point 2: to a certain extent Stan already does this by adapting both the scalar step size (based on a target acceptance statistic) and a diagonal metric (based on online estimates of target distribution variances in each dimension). This is assuming you are using what I think is the default cofiguration in most interfaces to use the diag_e (diagonal Euclidean) metric type. The step size and metric together determine the scale of proposed moves along the different dimensions of the target distribution and adapting the metric is meant to adjust for there potentially being different appropriate scales along different dimensions.

As the adaptation mechanism for the metric and step size differs, running samplers for each group separately and one sampler jointly fitting all groups (assuming for now all parameters are a-priori independent) won’t perform exactly the same however even with a diagonal metric. As well as the difference in the target for adaptation (acceptance statistics versus empirical estimates of variances), the step size is adapted continuously using a dual averaging algorithm while the metric is adapted in a series of memoryless windows using estimates of the target distribution variances accumulated over the windows (more details in the documentation: MCMC Sampling).

The metric adaptation is slower and so will typically require more warm-up iterations to get decent variance estimates, and so with regards to point 4, too few warm-up iterations may be part of the issue.

The adapt_delta tuning parameter sets a target for the average Metropolis acceptance probability along a trajectory and tunes it with dual averaging during warmup. As @matt-grapham points out, the varying step size effect comes from estimating a diagonal mass matrix, which Stan also does during warmup. We identify the diagonal mass matrix by setting it to the posterior (co)variance, then allowing step size to vary.

Also as @matt-graham says, it’s hard to help with suggestions without seeing more of the model.

I’m curious how softplus works compared to exponentiation—I’ve been trying softplus transforms for various distributions and exponentiation works better in almost all cases I tried—just be sure to implement with log1p_exp(x), which is Stan’s name for softplus (my fault—I hadn’t heard the “softplus” name when this got added to Stan).

What you’ll find is that HMC (including NUTS) works much better when the model is well specified. If it cannot find a good set of parameters to describe all the data, the sampler will struggle. Breaking it down into a per-group estimate should then work and you can give them a hierarchical prior to share strength across groups automatically.

Thank you all for your responses, and I’m so sorry for the slow response I tried this again with exponention as @Bob_Carpenter recommended and that definitely improved it. Since then, I’ve tried using softplus (log1p_exp) in a couple of different circumstances, and every time it makes awful posterior geometry… but I love that it makes an always-positive, linear relationship. Maybe one day I’ll make it work.

Separately from all of this madness, I just wanted to say that I am beyond grateful for all of your work on Stan. It is such a powerful, amazing tool to play with data. It opens up so may different doors, it is a creative puzzle, and it is an awe-inspiring machine. And not only did you make it, support it, and continue fixing, tinkering, and adding to it, you’re also amazingly present in the forums, helping people tirelessly, including ignorant people such as myself (who don’t even provide their model code, what was I thinking??). I am inspired by your dedication, patience, thoughtfulness, and intellects. Thank you so, so, so much for this amazing gift you have given us, and are continuing to give us.