Divergences in a non-centered computational model

Thanks Michael. This is very helpful.

I understand that you can’t comment on each specific model, so let me try to rephrase the question in a more general way. I hope it is OK.

I understand that sampling and divergences concern the non-constrained space. However, won’t the transformations done on parameters have an affect on the required step sizes at different locations in the non-constrained space as well (especially since the lp is defined by the constrained space, at least in my model).

More specifically - the most common issue discussed in centrally-parameterized model has to do with low values of the group scale parameter. The question is whether, particularly when all parameters are gone through a probit transformation - we would always expect the opposite pattern. That is, because such transformation squash extreme values, the case of high group variance might occur pathological because it will require smaller step sizes for individual-level parameters that are close to the boundary, and larger step sizes for parameters that are far from it. Would you expect this to be a general issue in models that use such transformation on parameters?

An even more general question is wasn’t able to find the answer to: what is the general interpretation of having the acceptance rate for divergent transitions (or the starting point thereof…) very low (and much lower that the acceptance for non-divergent transitions).

Finally, what happens in the case in which certain combinations of parameters cause a non-defined likelihood, as in the case in which the rate parameter of a bernoulli likelihood becomes 1. These samples are rejected, I assume - but can they cause divergences as well?

Thanks a lot!
Itzik

The lp__ variable you see in the summaries is actually defined on the unconstrained scale – it’s the log unconstrained joint probability density possibly modulo constant terms.

Yes, the constraining transformations do affect the unconstrained target density and hence the gradients which guide the HMC transitions. They are needed to map the unconstrained states back to the constrained states in order to evaluate the constrained target density, and then their log Jacobian determinants are needed to ensure a proper unconstrained density function.
Interestingly log determinants usually serve to regularize the unconstrained target density.

In any case the unconstrained target density will be influenced by the constraining map, and hence so too will the exploration on the unconstrained space. The reason to look at the unconstrained spaces that is “stretches out” everything going on at the boundaries making it easier to resolve where the numerical Hamiltonian trajectories are breaking down.

No.

Firstly the typical-discussed problem is the interaction between the individual level parameters, say \theta_{n} \sim \text{Normal}(\mu, \tau), and the population scale, \tau in the prior density. Low values of \tau in of themselves are not bad, but the concentration of the \theta_{n} that they cause induces a funnel-shaped density which is.

If the likelihood is narrow for all of the \theta_{n} then it will dominate the posterior density and these interactions will be suppressed. In a centered parameterization this leads to a nice posterior density, but in a non-centered parametrization the strong likelihood ends up coupling the non-centered \theta_{n} equivalents and \tau, resulting in a nasty posterior density. But keep in mind that in general this has to be considered on a \theta_{n} by \theta_{n} basis – if the data concentration varies from individual to individual then you might have to consider parameterizing each separately.

Secondly, the probit transformation is part of the likelihood and won’t influence those prior interactions. If you have lots and lots of data all at 0 or 1 so that the likelihood becomes singular at the boundary then maybe there’s an issue, but that requires ungodly amounts of data.

Oh, and the individual parameter-population scale interaction is just one possible degeneracy. There’s also a degeneracy between \mu and \tau that often bites people because no one tells them to look for it.

That is to be expected for the moment. The current “acceptance statistic” that is reported is the average of the Metropolis acceptance probabilities if you were to hypothetically try to propose each state in a numerical trajectory with the initial state. If a sample is labeled “divergent” that means that the trajectory from which it was drawn diverged at one end. The states at that end would have low hypothetical acceptance probabilities, weighting the average quoted in the “acceptance statistic” down.

Honestly there’s not much to be gained by looking at correlations of things with accept_stat.

Depends on the circumstances. The Bernoulli mass function with p = 1 is well-defined if all of the data are y = 1, but even one y = 0 yields a zero density and negative infinite log density. Same with p = 0 by switching y = 0 and y = 1.

That said the way constraints are implemented you won’t get p = 0 or p = 1 exactly without overflow if you use real<lower=0, upper=1> p. Same if you use your own constraining transform like the logistic (which is what is implicitly used in the constraint) or the probit. Even if you overflow, however, the gradient might not. In that case the numerical trajectories will all behave fine without any divergences.

2 Likes