I am interested in an inference problem where the posterior distribution is over positive semi-definite matrices whose trace is equal to 1. I have not begun working on this, but before I do, I would be interested to hear any opinions from those experienced users about how best to implement this, as in, what to put in the parameters section of the stan program. I understand there are constrained native types for cholesky factors, correlations, and covariance matrices.
I once wrote pair of inverse functions that transform between \mathbb{R}^{d^2-1} and the set of d\times d unit-trace positive semi-definite operators with the constraint based on stick-breaking (see here).
Would implementing this in stan be a useful approach, or is there a simpler way? Here is a blurb copied from the docstring of vec_to_state():
The unit trace condition is met by using a stick-breaking like trick, where
a given value of x doesn’t directly store a cholesky factor matrix element,
but rather, it stores (after being shrunk) the fraction of the remaining
budget of the trace until 1. Note that the trace of a positive operator
is the square Hilbert-Schmidt norm of its cholesky factor, so this is a
2-norm version of stick breaking.
That’s exactly what we do with correlation matrices under the hood, by the way. The full definition with Jacobian is in the manual in the chapter on constrained parameter transforms.
That’s exactly what we do with correlation matrices under the hood, by the way.
Cool. It seems the main difference is that cholesky stick-breaking operates per-row, while trace constraining stick-breaking operates on all entries together.
While reading through, I found a minor typo in the subsection “Correlation Matrix Inverse Transform”: It reads that tanh maps into (0,1), rather than (-1,1).
I have another fundamental question about the same problem. My prior distribution is based on a Ginibre ensemble, parameterized by K and D as follows:
data {
int<lower=1> D; // Dimension
int<lower=1> K; // Number of Ginibre subsystems
}
parameters {
matrix[D,K] X;
}
transformed parameters {
matrix[D,D] rho;
rho = X * X';
rho /= trace(rho);
}
model {
// Ginibre DxK prior
for (idx_row in 1:D) {
X[idx_row] ~ normal(0,1);
}
}
This is very convenient to implement, just iid normals on X. However, say D=K, then there are more parameters than necessary: X requires D*D real numbers, but rho is symmetric with trace 1 and thus requires D*(D+1)/2-1 parameters. My guess is that this is a bad thing to do because, depending on the likelihood added to the model, it might become multimodal. Is this correct? Is it better to use a 1-1 and deal with the jacobian, or will I be okay with this more convenient-to-implement parameterization?
Thanks again.
(Note, I am still interested in the complex valued problem where X has real and imaginary components sampled iid from N(0,1), but I have stated it as above to be cleaner. James’ trick does not work as cleanly in the complex case and I think I would need to write the transform and its jacobian from scratch, and figure out the prior on rho, etc.)
I think this should be OK. It’s what we do for generating points on a sphere. There are only N-1 degrees of freedom for a sphere embedded in N dimensions, but we use N parameters, give each a normal distribution, then project to the sphere. That provides a uniform distribution on the points on the sphere at the cost of some unit normals. This algorithm comes from Marsaglia 1972.
By the way, to_vector(X) ~ normal(0, 1) should be more efficient than the loop since we can share a lot of computations across the rows. The copies have to happen anyway to pull the rows out as Eigen stores column major, so for big matrices, it’s even more of a win.
Yes, this can be a problem with things like matrix multiplication because (-x)^2 = x^2.
In the sampling on a sphere case, there’s only a single line from the origin that results in the same point on a sphere, so nothing gets multimodal.
Without a way to control the multimodality of the parameters X, it can be problematic in the posterior monitoring X. On the other hand, it might not matter in that rho won’t wind up being multimodal.
Also, I’m not sure what distribution you get from X * X' given a normal distribution on the X components. What was your intended distribution on rho? The best thing to do would be to check whether that’s what you get with this coding.
I do not want to hijack this thread, but I have been looking for a clarification of a point here. Continuing the above example, if we only care about rho, and X is for some reason of no interest, do we need to worry about the posterior monitoring on X? In other words, if we get Rhat=1 for rho, can we safely ignore Rhat>1 for X?
Note: if more details are needed about my specific case, I can start a new thread to avoid derailing this one further.
If the posterior for alpha concentrates away from 0, as in this example, the posterior on alpha_raw will be multimodal. This should be OK for alpha in this simple case. Here’s what the output looks like for 4 chains and default settings:
By inspection, the 95% intervals look good—they should be about +/- 1.96 sd and they’re (0.1, 0.3).
You can see from Rhat and n_eff that theta didn’t mix—two or more chains got stuck in different modes.
The way our adaptation works within a single chain actually mitigates the most problematic aspect of transforming a multimodal prior, namely that it’s likely that if alpha is heavily constrained, then alpha_raw won’t try to switch modes in the posterior during warmup, so adaptation will be OK. If a chain would explore both modes, then we run into a problem with adaptation trying to account for them both, which won’t be nearly as efficient.
Thanks for the very helpful info and the example, I will give it some more thought. The posterior on rho should be unimodal, and all of my tests are looking very much like your example in terms of Rhat and n_eff.
Glad this is of broader interest to some, this is exactly what I am wondering too.