A new parametrization transform a set of constrained parameters between a lower and upper bound, such that: p_unc = f§, where p_unc is an unconstrained set of parameters. The gradient of the log posterior must be multiplied by the Jacobian of the inverse transform to sample in the unconstrained space. The Jacobian terms could be very close to zero if the constrained parameters are close to their bounds. I am wondering, how does the algorithm handle this situation in the proposal distribution of the HMC algorithm?
It quite possibly can’t handle this situation numerically, even if you have a well-posed posterior distribution. The NUTS algorithm tries to adapt its tuning parameters during the warmup period but this may be largely unsuccessful when you have log-Jacobian terms that are very large in magnitude for some parts of the parameter space and relatively small in other parts. In that case, at the end you will get a warning about divergent transitions and you may well have to find an alternate transformation to draw from this posterior distribution.
Do you think that using a penalty function with the likelihood may solve the problem? It think it may prevent candidates suggested by the proposal distribution to stay stuck near the bound.
I doubt the Markov Chain will get stuck but NUTS may diverge when a proposal gets too close to a bound. Basically, there should be no posterior density at the bounds, which can be accomplished via a prior that places zero density at the bounds and almost zero density near the bounds.
Stan uses a NUTS-like algorithm by default, which is not a form of Metropolis in the sense of having a proposal distribution and Metropolis-Hastings correction.
If you have probably mass near boundaries, you’ll have statistical and computational problems and will need to take the steps @bgoodri suggests to remediate.
We generally do not recommend interval-bounded priors unless you have physical or logical constraints like requiring a correlation to be in [-1, 1] or a probability to be in [0, 1] or a component of an ordered vector to be between the previous and following member of the vector.
I have a related question: I have a model where I have a constraint on a parameter <lower=0, upper=1> because only values in this range make sense (in terms of how this model appears to be used in the field).
However, I would expect many people to have a value of 1 (or very close to 1) - and this is also what I find in the fitted parameters. I assumed that this would be fine with boundaries <lower=0, upper=1>, but from this post it sounds like this is not good? What should I do instead, should I change my boundaries to be e.g. <lower=0, upper=2>?
I should also add that I have also given this parameter a prior normal(1,1), because I thought that the value 1 should be more likely than other values - maybe this is also causing problems?
I also wonder how I can see in my model that I have problems because of the boundaries? The model does converge (500 sample warmup+ 500 iterations take 10h per chain though). I also have divergent samples, I tried to check whether they occur when this parameter is close to the boundary; it’s difficult to say because there is two parameters with these constraints for each of 75 participants, but it does not look like they are more likely to be divergent at the boundary.
Is the constraint to
lower = 0, upper = 1 a necessary restriction, like for a parameter to a bernoulli? Then you have to deal with the probability piling up some way or another.
The second case you get something like this is stationary parameter space for time series. Some values of parameters lead to stationary processes. You can force your parameters into that regime, but it’s often better to let the model and data adjudicate the posterior rather than a hard constraint.
We generally don’t recommend hard boundaries at all unless absolutely necessary because they don’t act like uniforms in maximum likelihood solutions—that is, if you get mass near the boundary, it’ll bias the posterior compared to a wider uniform.
normal(1, 1) isn’t going to do a whole lot between 0 and 1 other than slightly favoring 1–it shouldn’t be a computational problem.
The divergences are marked from where the path starts, not from where it diverges, so it can be hard to tell.
You an do sensitivity tests in the usual way by trying different priors (like wider intervals or no intervals with soft constraints) and see what they do to the posterior.
Things like posterior predictive checks should let you diagnose whether your model can model your data adequately.
I disagree with this. It is essentially the same as saying a covariance matrix should not have to be positive definite. An explosive time series process is just as useless as negative variance.
I think that’s just a matter of determining what’s a hard constraint for the problem.
Why is an explosive time series process useless? Aren’t a lot of time series explosive?
Wouldn’t you only want the hard constraint if you know you’re dealing with a stationary process?
You may want to take this up with Andrew, as he’s the one who’s always saying this. I don’t personally have much experience with real-valued time series.
I have tried, but it doesn’t help any. If the value at time
t is Gaussian conditional on the value at time
t - 1, then for example, appendix 1 of
lists the marginal distribution of the value at time
t is Gaussian with variance proportional to
1 / (1 - phi^2). So, if the AR1 coeficient (
phi) is outside of the [-1,1] interval, then your “generative” model implies a negative marginal variance for the value at time
t = 0. Also, I can’t find a link at the moment, but I remember that in a Gaussian AR1 model, the likelihood function works out to the same value whether the AR1 coefficient is
1 / phi, so you basically have to pick the parameterization that makes sense.
thank you very much for the explanation, this is very helpful.
One reason I thought I needed to have a hard upper boundary is because parameters are likely to be between ~ 0.5 and 1 (most likely at 1), but unlikely outside this interval (and definitely not below 0). But now I’m thinking that I could instead maybe use skew_normal, centred on 1, with a skew towards 0 and only use a lower boundary. This could work? I’m running it now with simulated data to see whether I can recover simulated parameters…
That’s what I was looking for. That sounds like it needs to be a hard constraint then for the model to make sense. I just thought it meant you’d get something like drift.
That makes it sound like
lower = 0 is a hard constraint, but
upper = 1 is a soft constraint in that you don’t expect estimates above 1, but they don’t break the logic of the model. That’s the situation for which we’ve been recommending softer constraints.