I’ve been trying to fit the mean and standard deviation of a variable \textrm{var} linearly over time, simultaneously. Here’s my model \textrm{var}\sim\textrm{Normal}\Big(\mu_\textrm{slope}t+\mu_\textrm{intercept}, \textrm{softplus}(\sigma_\textrm{slope}t+\sigma_\textrm{intercept})\Big)
Where \textrm{softplus}(x) is the function \log(1+e^x) to ensure positivity of the standard deviation.

I have a dataset which has data points for \textrm{var} over time for several groups. Very simply, I want to fit \mu_\textrm{slope}, \mu_\textrm{intercept}, \sigma_\textrm{slope}, and \sigma_\textrm{intercept} for each group separately. (Later, I will add predictive factors and all of that sort of thing. We aren’t there yet, clearly.)

When I run the entire dataset, making the slopes and intercepts vectors, I get thousands of divergent warnings. But on each group individually, it works fine. I even managed to find a set of 4 group where 1+2, 1+3, and 1+4 are all perfectly fine, 2+3+4 is perfectly fine, but 1+2+3+4 is a trainwreck.

My guess is that there’s some adapt_delta nonsense going on—for some group, it needs a larger stepsize, and for other datasets, it needs a smaller stepsize. Individually, it chooses the right stepsize and moves on. But when all together, it can’t find any stepsize because it needs to do different stepsizes for each entry.

Am I interpreting this correctly?

Is there a way to free up Stan to fit different stepsizes for each entry?

Is there another, more elegant solution that would fix this?

Is this a lack of warm-ups problem? (I gave it as much as I thought my computer could take.)

Is there a way to free up Stan to fit different stepsizes for each entry?

Is there another, more elegant solution that would fix this?

Is this a lack of warm-ups problem? (I gave it as much as I thought my computer could take.)

For point 1 having a full working example would help here in trying to diagnose what’s going on (Stan model definition, example data, and details of what interface you are using and any changes to default sampler parameters)!

For point 2: to a certain extent Stan already does this by adapting both the scalar step size (based on a target acceptance statistic) and a diagonal metric (based on online estimates of target distribution variances in each dimension). This is assuming you are using what I think is the default cofiguration in most interfaces to use the diag_e (diagonal Euclidean) metric type. The step size and metric together determine the scale of proposed moves along the different dimensions of the target distribution and adapting the metric is meant to adjust for there potentially being different appropriate scales along different dimensions.

As the adaptation mechanism for the metric and step size differs, running samplers for each group separately and one sampler jointly fitting all groups (assuming for now all parameters are a-priori independent) won’t perform exactly the same however even with a diagonal metric. As well as the difference in the target for adaptation (acceptance statistics versus empirical estimates of variances), the step size is adapted continuously using a dual averaging algorithm while the metric is adapted in a series of memoryless windows using estimates of the target distribution variances accumulated over the windows (more details in the documentation: MCMC Sampling).

The metric adaptation is slower and so will typically require more warm-up iterations to get decent variance estimates, and so with regards to point 4, too few warm-up iterations may be part of the issue.

The adapt_delta tuning parameter sets a target for the average Metropolis acceptance probability along a trajectory and tunes it with dual averaging during warmup. As @matt-grapham points out, the varying step size effect comes from estimating a diagonal mass matrix, which Stan also does during warmup. We identify the diagonal mass matrix by setting it to the posterior (co)variance, then allowing step size to vary.

Also as @matt-graham says, it’s hard to help with suggestions without seeing more of the model.

I’m curious how softplus works compared to exponentiation—I’ve been trying softplus transforms for various distributions and exponentiation works better in almost all cases I tried—just be sure to implement with log1p_exp(x), which is Stan’s name for softplus (my fault—I hadn’t heard the “softplus” name when this got added to Stan).

What you’ll find is that HMC (including NUTS) works much better when the model is well specified. If it cannot find a good set of parameters to describe all the data, the sampler will struggle. Breaking it down into a per-group estimate should then work and you can give them a hierarchical prior to share strength across groups automatically.