Warmup variance estimate zero?

Today I got a traceplot looking like this (only a few parameters are shown):

I observe:
Some parameters don’t move.
Some parameters look jumpy (they don’t move in every iteration), and they move so little that the y-axis labeling doesn’t have enought digits to show the difference.
Which parameter does what varies between model runs
Sometimes there are parameters which look more ok (they move in every iteration).

What I think is happening:
Since some parameters do move, some iterations are obviously non-divergent. Therefore the non-movement of the other parameters can’t be caused by divergences.
This model forces Stan to decrease stepsize massively during the first parts of the adaptation (checked: stepsize < 2^-50). I suspect this leads to a point where some parameters jump so little that they run into a numerical problem, leading to jumpsize zero. This might result in a zero-variance estimate for this parameter which could then be used to base the mass-matrix on (I didn’t check the mass matrix). Consequently these parameters can never recover, and continue to have zero-length jumps even as the stepsize increases during later parts of the adaptation. (stepsize was still very low at the end of the adaptation, though, so all the parameters, even if they do move, move very little, longer warmup would’ve been required)

Question 1: Does this interpretation seem good?
Question 2: If so, Stan might benefit from some mechanism that disallows zero-variance mass matrices or prevents this from happening in some other way. This might allow it to recover from such a situation, given long enougth warmup. (The point of this post was to bring up this suggestion)

Edit: In practice a recover from such a situation might take too long, even if possible. This might improve with the campfire warmup routine.

Edit2: Feel free to strike this down If you think it’s not relevant in practice. It was more of a spontaneous thought than a well reasoned suggestion.

2 Likes

The plot shows severe problems. It doesn’t matter which parameter move and which don’t.
Without any details about data and model it’s impossible to tell something about the problem.

Just to make this clear in case it isn’t:
I’m not posting this to get help with the model, but because I thought it seemed like a bad idea to me that the sampler allows that the variance drops to zero. I wondered if it would be helpful to have a mechanism that prevents this. Whether this admittedly possibly rather academic issue is relevant in practice is the part of the question which I personally can’t answer, which is why I’m asking in this forum.

2 Likes

We can’t offer advice in a vacuum. Any number of things may be going on, including a bug in the program or that you’re monitoring the diagonal elements of a correlation matrix.

As a start, I’d sugest running warmup much longer than 500 iterations as the chain clearly hasn’t converged.

What you’re seeing in those flat parts of the trace plots is the whole chain getting stuck. That happens when the stepsize is too large for the curvature and the Hamiltonian diverges during simulation. That might not diverge enough to give a warning, but enough that all the proposals are rejected.

If Stan takes the step size down to a ridiculously small value, it means the system is very stiff (one dimension moves much faster than others), and the only hope for the problem is to reparameterize.

Stan would benefit from anything that makes the expected squared jump distance per iteration larger. That reduces autocorrelation time and increases effective sample size. Did you have a proposal for how to do this?

If the condition number of the Hessian varies across the posterior, a static metric isn’t going to help.

I have the feeling opening this thread might have been a bad idea. It seems to cause more confusion than is worth. But I’ll carry on for the time beeing.

I’m not asking for advice, merely relaying an observation with accompanying thoughts.

It looks the same with 2000 iterations warmup. Maybe I have tested longer warmups as well. I’ve looked at the warmup-traces of various diagnostics (stepsize, treedepth, divergences, acceptance rates, possibly more) and it seemed like they had reached a stable state and didn’t improve anymore.

But shouldn’t all traces be flat in that case? I mean, as far as I know it cant accept some parameters while rejecting others.

I’m not familiar with that term. I’ll look into it. Thanks. (Althought that wasn’t the point of this thread)

No, I don’t have a proposal. I should note that my point was not about expected jump distances, but about putting a lower limit on them so they can’t drop to zero.

1 Like

I think that there is generally some resistance in the dev team against people “just suggesting” improvements to the Stan algorithms based on one or just a few cases where they “see something”. This may come off as a bit hostile and my personal opinion is that we should be better at handling such suggestions - at least we should better manage expectations of people who suggest improvements. We want people to mess with Stan and it’s algorithms and we want people to invest their time to try to make Stan better. Nevertheless the skepticism towards suggestions such as yours is partially warranted by the fact that most corrections to the algorithms don’t work and AFAIK the dev team generally has many more promising ideas for Stan than capacity to investigate/test/implement them.

If you want to contribute to algorithm development, I think that’s great, but currently this generally means quite high burden of proof for any suggested improvement before it is even considered for discussion. Some hints that could help get traction on the idea (though I am not a part of the more “core” dev team, so maybe @Bob_Carpenter or others would have different opinions):

  • Inspect the mass matrix and check that it really has zero variance (via the diagnostic file)
  • Create a fork of Stan and hack it to avoid those zeroes - does it help for this specific model?
  • Find at least one practically relevant model (but preferably multiple) where using this zero avoiding hack improves sampling.
  • Try to find what could break - are there reasons why actual models would need zero or very-near-zero variances to sample well (e.g. extremely narrow priors?)

This might be a lot of work and I totally understand if you don’t want to put this work in. But without stronger claim for potential to improve Stan, your suggestion is IMHO currently unlikely to be seriously entertained - still good that is here on the forums and maybe someone with a similar problem will stumble upon it and find it useful sometime in the future.

3 Likes

I can fully understand (and anticipated) this. I’m still figuring out what is the best way to contribute (hoping not to cause too much collateral damage on the way). This probably means that I will try to focus more on bigger, more substantiated contributions in the future. Thank you for speaking openly about this, it helps me a lot to figure out the best approach.

I guess that is the case for most such teams. Still, there are many great simple ideas out there that get overlooked (I’m not talking specifically about Stan here, but more generally about any type of project). Sometimes interesting new ideas and viewpoints come from people with a different background, which tend to be outsiders to the projects the ideas are about. I feel these often get overlooked. At the same time I absolutely understand that it is difficult and time consuming for project teams to differentiate between the better and the worse ideas, thus teams might be forced to spend (much) less time on evaluating them then they’d otherwise deserve.

I really tried to do my best to limit collateral damage in the opening post by beeing very open about the fact that this was just a spontaneous idea, rather than a well reasoned suggestion, or even a request. I feel the respondents might have read a bit much into it. Their answers certainly were longer and more elaborate than I expected. But that is how things go, and people like me who like to post such thoughts should be aware of such possible consequences. Still, I wonder if there is a better way or place for such Ideas, and if not whether it would be worthwhile to create one.

I agree that this would be the way to go.

I’d be totally fine with it getting ignored. Seriously, I just wanted to throw the thought into the room, in case someone finds it interesting (I probably shouldn’t have used the word “question” in the opening post). As soon as I find the time to do substantial work on Stan improvements again I actually first have other more important projects to finish myself, too (e.g. the thread about circular variable sampling and the one about divergence diagnostics).

3 Likes

Are you talking about my reply to @Raoul-Kima above?

That certainly wasn’t my intent. I’m honestly trying to stay constructive. I was the one who was so adamant that our forums be respectful to new users in the first place. I absolutely don’t want to chase them off the lawn. In fact, I used to have @martinmodrak’s job of policing the forums for several years because I thought it was so important to have a welcoming forum in contrast to what I’d seen in other projects.

In this instance, I was trying to clarify what was being proposed. @Raoul-Kima responded that it was a suggestion to just lower bound stepsize in the algorithm.

I do. The reason stepsizes drop is that the algorithm can’t find a small enough step size to not cause divergences in simulating the Hamiltonian. This can’t be fixed by putting a lower bound on stepsize. That’ll just keep rejecting points and not move.

I’m trying to help users understand the properties of the algorithms so they can avoid wild goose chases. In my response, I didn’t understand the proposal was to add a simple lower bound, because it’s well known that won’t work.

3 Likes

Thanks for continuing to engage here. I really wasn’t trying to chase you off the lawn!

It’s not so much overlooking ideas as being inundated with them. We get more suggestions from users and in the computational stats literature than we can possibly even read, much less evaluate. Honestly, I can’t even come close to evaluating all the ideas that I have, much less that everyone else has.

But we do try to respond on the forums to everyone who engages us directly.

Some ideas we know can’t work because of both experience and theory, like lower-bounding step size. Others we know don’t work for our use cases because we or others have evaluated them pretty thoroughly, like simple ensemble sampling. Others work, but we’re unsure of their utility for our users, like control variates. Some we just don’t know about, but don’t have time to evaluate, like all the new samplers out there. Many suggestions are just too vague in the sense that we wouldn’t know how to implement it even if we wanted to. These ideas often cut across CS, applied math, and stats, so they can be hard to communicate; we struggle with that in our own writing and grant proposals.

The real problem is that we can’t know how much time it’s worth spending on evaluating an idea until we know the answer. That’s why we try to get the proposer to evaluate things or wait until positive results are replicated in the literature. Then if they look promising, we can include them in Stan.

If this is still coming off as hostile, could you please let me know where?

3 Likes

You didn’t come over like that at all. I think @martinmodrak was making a general point there, not about your particular answer.

Not at all!

I get that. I’ll do my best. Not on this threads topic, though, since as I wrote, I have more ideas than I can evaluate myself, too. The idea of this thread was one where I thought there is a chance that one might at first glance agree on the the problem an solution I was proposing. If it requires more than a first glance to evaluate I totally agree that there are better things to do than care about this. After all the suggested “improvement” would probably at best only help in some very extreme corner cases. This is why I intentionally kept the opening post short and included only very little information.

The thing is just that some fruits are very small, but also hanging very low, and I thought there is a chance that this might be one of those. The main lesson I’m getting from this thread is that I shouldn’t post about such “possible small, low hanging fruits” any more, or at least not without a better evaluation. Or maybe I just need to learn how to communicate my intention better. (Is there an “Attention: crazy idea/ possible small, low hanging fruit, Ignore if it isn’t.” badge that can be attached to threads?)

Maybe I misused the word stepsize. I didn’t mean stepsize as in the parameter that multiplies with the estimated variances of the model parameters to determine the size of the jumps for the individual variables in each leapfrog step, but it is the size of the leapfrog steps (looking at each individual variable separately) that I was talking about.
My interpretation of the traceplot was that for some variables this stepsize was zero, while for others it wasn’t. With zero stepsize the sampler can’t work as it stops to explore. Conversely, divergences might be caused by other variables than the ones that drop so low in stepsize, so the sampler might still continue to work with the lower bound in place.
In addition my interpretation was that the drop to zero stepsize was caused by numerical issues arising from the very low value of the “stepsize parameter” (the one that influences the stepsizes of all model parameters), thus the idea was that putting a lower bound on the stepsizs of the individuals parameters would be just a different (better) way to handle these numerical inaccurracies. Again, all assuming that my interpretation of the traceplot was correct.

2 Likes

That’s not our intent!

We love hearing new ideas and can often provide feedback to save you time in the evaluation. Lots of these things we’ve evaluated ourselves.

We use “stepsize” exclusively for the discretization period of the Hamiltonian simulation. Here’s the leapfrog, where q[n], p[n] are position and velocity at step n and pi(q) is the log density (the contribution to the Hamiltonian is -log pi(q)) and M is the mass matrix estimated from posterior draws:

p[t + epsilon/2] = p[t] -  epsilon / 2 * grad log pi(q)
q[t + epsilon] = q[t] + epsilon * p[t + epsilon/2] * inv(M)
p[t + epsilon] = p[t] - epsilon/2 * grad log pi(q)

What we call “stepsize” is the epsilon term. That’s the amount of time one step advances the simulation. We take two half steps in velocity p, and one full step in position q. I wrote this to explicitly make that clear in the indexing by continuous time (this is how Neal presented it in his overview article.

Can you phrase your suggestion in terms of the leapfrog algorithm?

I’m still not sure what you mean by “stepsize”, but you can evaluate these claims by looking at the traceplot. Usually what happens is that the whole chain freezes in areas of high posterior curvature because it uses a step size that’s too large and even the first leapfrog step leads to a state where the Hamiltonian diverges too much to ever move because its value at the initial state is so much higher than after it diverges for a step.

2 Likes

I’ll try to be short and precise in the Answers so we don’t bloat this thread unnecessarily anymore.

My suggestion could be implemented by constraining M in such a way that abs(q[t+epsilon]-q[t]) (where abs acts element-wise) can not have an expected value of zero for any of its dimensions. My assumption was that this zero (which can be seen in the traceplot in the opening post) can occur when epsilon is very small, because due to numerical inaccurracies small values of q[t+epsilon]-q[t] might become zero.

The traceplot (shown in the opening post) shows that only some variables freeze, while the others continue to move. Which of the variables freeze varies between consecutive MCMC runs.

Than I’m not quite sure what I’m doing wrong yet. I’ll keep learning. Surely the size of the fruit must be taken into consideration because of the overhead involved when posting such a question/suggestion.

Is very hard to predict before you grow it and eat it!

I’d check the actual draws rather than relying on the reduction to a visual. They’re probably moving a little bit. The only way a parameter won’t move is if the momentum is zero. In the simple case of a unit metric, that only happens when the gradient is zero (or underflows to zero).

The only way that we get q[t + epsilon][i] - q[t][i] == 0 (position for parameter i doesn’t change over time epsilon) is when the momentum is zero (usually from underflow). In that case, the geometry’s telling us that the amount to move q[t][i] up to machine precision is zero.

Just building a little bump like this into Euclidean HMC based on values of q[t] and q[t + epsilon] breaks the detailed balance proof, so you don’t know you even have the right stationary distribution. Euclidean HMC requires a constant mass matrix M (compare Riemannian HMC). Instead, you can alternate random-walk Metropolis (RWM) and HMC. The RWM steps will try to move you in every dimension all the time or can randomly choose a subset of dimensions. It also higlights the difficulty—it’s almost impossible to move very far using RWM in high dimensions, especially in places where HMC gets stuck.

If you have a more subtle algorithm for which you can prove you get the right stationary distribution, I’d suggest you write that out. That’s usually where we start as it’s a minimum requirement to ensure we’re starting with an algorithm that has a chance of getting the right answer.

I thought that if the stepsize is very small this can also happen (underflow) for gradients that are not zero.

I wasn’t thinking of applying this during the sampling phase, only for adaptation. Basically just as as way to survive the underflow phase, until the metric has setteled into more sensible values and stepsize increases again, thus preventing the underflow issue. (whether that actually can happen in practice is another open question)

Anyways, I feel this topic is not worth the effort needed to carry on with the question. Finding out whether it is worth that was kinda the point of opening the thread in the first place. So one could say that the threads question was answered, kinda.

Good point. The condition for getting stuck numerically is given by the leapfrog update

q[t + 1] = q[t] + p[t + 1/2],

where q[t], p[t] are position and momentum at time t. This results in q[t + 1] == q[t] whenever

epsilon * q[t] > abs(p),

where epsilon is machine precision.

OK. It’s a lot of work to investigate an idea like this more thoroughly!