Can I join this party?
If you’re talking about the paper, can you send me an email to discuss? Thanks.
Are you also evaluating adaptive transformations?
What do you mean by “adaptive transformations”?
Just transformations with parameters which are learned during or prior to sampling from the posterior, as e.g. here. “Automatic” centered/partially centered/non-centered parametrizations are one example.
We haven’t been considering that case. It would certainly be another level of pain in evaluation as now we have the degree of freedom of how fast the adaptation works and how reliably. I’ve been very intrigued by Matt’s more recent work on MEADS that uses ensembles to tune cross-chain. That would provide a workable framework for tuning adaptations like this.
P.S. Thanks for pointing Maria and Matt’s paper out again. I was under the misapprehension they mainly worked on discrete parameters!
Sounds great!
I didn’t do anything fancy. I just used the transform from the marsaglia paper and the one from the wiki page I linked and inverted the transformation described on the latter. I should have some messy notes lying around about that inversion, but it’s pretty simple. I could try to dig them up, tidy and post them it if you think that’s helpful.
I’m not aware of any tries to find one. But I didn’t really look for it. I suggest trying.
A dense inverse metric (the object is a “mass matrix” only when the Hamiltonian system is a physical one) does captures correlations, but only global ones. Nonlinear “curvature” in the posterior density function, where the shape of the posterior density function varies across the model configuration space, cannot be accommodated by any static inverse metric.
Yes the uniform distribution should concentrate at larger radii with increasing dimension but that’s only in the prior model. The interaction with the likelihood function can include more complicated geometries beyond the possible overlap with the origin. For example if the likelihood function is localized in the angular directions then the resulting posterior density function can be “pinched” in the direction of smaller radii. This would cause increasing divergences with dimension.
This is the challenge with evaluating these kinds of constrained type implementations – what matters isn’t just the behavior of the uniform distribution within the constraint but also how that constraint might interact with “reasonable” likelihood functions. Defining “reasonable” and then evaluating is no trivial task.
We’re currently evaluating a bunch of constrained parameterizations and we’re looking for suggestions on densities to use for evaluation. For simplex parameterizations, we’re looking at symmetric and asymmetric Dirichlet with a range of concentration parameter values. We should probably also look at logistic normal, but that brings in its own parameterization (e.g., softmax of multivariate normal).
We need to come up with evaluations for the other constrained types like unit vectors, correlation and covariance matrices, etc.
I think the most useful evaluations will take into account the geometry of the constraints rather than the exact shape of the posterior density function per se. For example
Simplices: concentration at individual corners, concentration on edges, concentration in the bulk, uniform
Unit-vector: narrow and wide angular concentration, concentration near or away from any implementation artifacts
Covariance matrices: concentration near singular boundary of the positive-semi-definite cone, concentration near the positive-semi-definite cone vertex, concentration around the the unit matrix.
Thanks, @betanalpha—that’s in line with what we’ve been considering in looking for density targets that probe the constrained geometry. For instance, the Dirichlet with parameter << 1 concentrates mass at boundaries, 1 is uniform, whereas parameters >> 1 concentrate on the uniform simplex (at least for symmetric cases). What we’re less sure how to do is introduce things like correlation into the tests, which we don’t get with Dirichlet alone. Adding logistic normal with correlation seems like one way to go for simplexes.
We’re also a little unsure about how best to test behavior in the tail, body, and near a head (mode/pole) of a density. We can always look at concavity and the condition number if concave. For tails, we can measure how long it takes HMC to get from the tail to the bulk of the distribution. For the body, we can measure mixing time. For the head, we could measure sampling time back to the body. Or we could measure a Laplace approximation’s goodness of fit somehow, but that seems to indirect.
I’m not sure what you mean by correlations here. In general the Dirichlet has not one parameter but N; tuning each alpha_{n} independently allows for a variety of distributional behaviors on the simplex. For example one \alpha_{n} \gg \alpha_{\setminus n} results in concentration at a corner. Two large elements results in concentration along an edge, etc. By symmetry you could look any one element dominating, then any two, any three, etc.
I don’t think this is the right way to think about the evaluation. The chosen target density functions will define a typical set which intersects with the Jacobian determinant and un-constraining function. Different target density functions will probe different parts of the Jacobian determinant and unconstraining function.
Ultimately what will matter for performance is how this Jacobian determinant and unconstraining function interact with the target density function across this intersection which is far too complicated of a question to have a compact answer. But by constructing “adversarial” target density functions you can probe potentially problematic regions. For example investigating what happens when the unconstrained density function concentrates in regions where the unconstraining function or Jacobian determinant , or their determinants, are large or oscillate between extremely large and extremely small values. If you can come up with an adversarial density function the doesn’t look too pathological but breaks the sampler then you start to build up an understanding of the vulnerabilities of the given transformation.
The Dirichlet is very weak representationally in that there’s no way to model covariance. In contrast, the logistic multivariate normal does. That is, if
Y \sim \textrm{multi-normal}(\mu, \Sigma),
then
\textrm{softmax}(Y) \sim \textrm{logistic-multi-normal}(\mu, \Sigma).
With the logistic multi-normal, you can model correlation among the elements of a simplex.
This is where we started.
This is exactly what we’re trying to figure out how to do. We did the ones for the Dirichlet as you suggested, but it’s still a very simple distribution without much correlation (other than the obvious negative correlation among simplex elements). When thinking about extending to logistic-multi-normal, I worry that parameterizations of that density and its variate transform are also in play.
I was referring specifically to the tuning needs of the unit-vector parameterisation (the one from the other thread linked at the beginning of this thread). Of course other parts of the model (for example the likelihood of the unit vector) might have their own requirements. For a narrow likelihood (which is where tuning could be needed for that parameterisation) this parameterisation should only induce the kinds of shapes which a dense inverse metric allows stan to sample efficiently, as far as I can see.
But the likelihood is orthogonal to the radius. Doesn’t that mean that it can not have an influence on its distribution?
Here’s a pathological example
data {
int<lower=0> N_over_two;
}
parameters {
unit_vector[N_over_two * 2] x;
}
model {
x[1 :N_over_two] ~ normal(0, 0.1);
x[N_over_two : 2 * N_over_two] ~ normal(1, 0.1);
}
As N_over_two
increases the divergences go away. Set N_over_two = 1
and you should nearly always get some divergences. Adding the scaling parameter suggested by @Raoul-Kima fixes this issue as it pushes enough mass away from y = 0.
library(cmdstanr)
mod <- cmdstan_model("test.stan")
mod_out <- mod$sample(
data = list(N_over_two = 1),
parallel_chains = 4
)
For models that are uniform over the unit sphere I would agree, but this discussion has been about the unit_vector
type for which uniform models are just one possible application. My hesitations here are limited to the burden of a proper type in the Stan language and should no way be interpreted as critiques for specific models!
This is the cartoon on which my intuition is based.
As the number of dimension increases the outer arc becomes exponentially wider than the inner arc so that even though the vertex of the likelihood wedge is cut off the posterior can be still be “pinched” enough to be problematic.
When I wrote this …
… I wasn’t writing about sampler adaptation (which is a form of tuning as well), but about tuning the width of the radius distribution (which is a part of that parameterisation).
I was under the impression, that it is only strong “single-dimension pinching” that causes problems, as a strong combined pinching over multiple dimensions (where each dimension by itself only has weak “pinching”) doesn’t come with strong “radial sharpening” of density gradients. 2D-Analogies can’t explain that. I actually predicted that the parameterisation would to better with higher dimension, as due to concentration of measure it would concentrate more narrowly around a specific radius. Some day I’ll run tests on that to see what really happens in practice.
Funnel geometry does get worse in higher-dimensions. Sometimes the diagnostics look cleaner, but only because it’s actually harder for the Markov chains to even dip into the neck to experience the problematic geometry. See for example the experiments in Section 4.1 of Hierarchical Modeling.