I am interested in an inference problem where the posterior distribution is over positive semi-definite matrices whose trace is equal to 1. I have not begun working on this, but before I do, I would be interested to hear any opinions from those experienced users about how best to implement this, as in, what to put in the parameters section of the stan program. I understand there are constrained native types for cholesky factors, correlations, and covariance matrices.

I once wrote pair of inverse functions that transform between \mathbb{R}^{d^2-1} and the set of d\times d unit-trace positive semi-definite operators with the constraint based on stick-breaking (see here).
Would implementing this in stan be a useful approach, or is there a simpler way? Here is a blurb copied from the docstring of vec_to_state():

The unit trace condition is met by using a stick-breaking like trick, where
a given value of x doesnâ€™t directly store a cholesky factor matrix element,
but rather, it stores (after being shrunk) the fraction of the remaining
budget of the trace until 1. Note that the trace of a positive operator
is the square Hilbert-Schmidt norm of its cholesky factor, so this is a
2-norm version of stick breaking.

Thatâ€™s exactly what we do with correlation matrices under the hood, by the way. The full definition with Jacobian is in the manual in the chapter on constrained parameter transforms.

Thatâ€™s exactly what we do with correlation matrices under the hood, by the way.

Cool. It seems the main difference is that cholesky stick-breaking operates per-row, while trace constraining stick-breaking operates on all entries together.

While reading through, I found a minor typo in the subsection â€śCorrelation Matrix Inverse Transformâ€ť: It reads that tanh maps into (0,1), rather than (-1,1).

I have another fundamental question about the same problem. My prior distribution is based on a Ginibre ensemble, parameterized by K and D as follows:

data {
int<lower=1> D; // Dimension
int<lower=1> K; // Number of Ginibre subsystems
}
parameters {
matrix[D,K] X;
}
transformed parameters {
matrix[D,D] rho;
rho = X * X';
rho /= trace(rho);
}
model {
// Ginibre DxK prior
for (idx_row in 1:D) {
X[idx_row] ~ normal(0,1);
}
}

This is very convenient to implement, just iid normals on X. However, say D=K, then there are more parameters than necessary: X requires D*D real numbers, but rho is symmetric with trace 1 and thus requires D*(D+1)/2-1 parameters. My guess is that this is a bad thing to do because, depending on the likelihood added to the model, it might become multimodal. Is this correct? Is it better to use a 1-1 and deal with the jacobian, or will I be okay with this more convenient-to-implement parameterization?

Thanks again.

(Note, I am still interested in the complex valued problem where X has real and imaginary components sampled iid from N(0,1), but I have stated it as above to be cleaner. Jamesâ€™ trick does not work as cleanly in the complex case and I think I would need to write the transform and its jacobian from scratch, and figure out the prior on rho, etc.)

I think this should be OK. Itâ€™s what we do for generating points on a sphere. There are only N-1 degrees of freedom for a sphere embedded in N dimensions, but we use N parameters, give each a normal distribution, then project to the sphere. That provides a uniform distribution on the points on the sphere at the cost of some unit normals. This algorithm comes from Marsaglia 1972.

By the way, to_vector(X) ~ normal(0, 1) should be more efficient than the loop since we can share a lot of computations across the rows. The copies have to happen anyway to pull the rows out as Eigen stores column major, so for big matrices, itâ€™s even more of a win.

Yes, this can be a problem with things like matrix multiplication because (-x)^2 = x^2.

In the sampling on a sphere case, thereâ€™s only a single line from the origin that results in the same point on a sphere, so nothing gets multimodal.

Without a way to control the multimodality of the parameters X, it can be problematic in the posterior monitoring X. On the other hand, it might not matter in that rho wonâ€™t wind up being multimodal.

Also, Iâ€™m not sure what distribution you get from X * X' given a normal distribution on the X components. What was your intended distribution on rho? The best thing to do would be to check whether thatâ€™s what you get with this coding.

I do not want to hijack this thread, but I have been looking for a clarification of a point here. Continuing the above example, if we only care about rho, and X is for some reason of no interest, do we need to worry about the posterior monitoring on X? In other words, if we get Rhat=1 for rho, can we safely ignore Rhat>1 for X?

Note: if more details are needed about my specific case, I can start a new thread to avoid derailing this one further.

If the posterior for alpha concentrates away from 0, as in this example, the posterior on alpha_raw will be multimodal. This should be OK for alpha in this simple case. Hereâ€™s what the output looks like for 4 chains and default settings:

By inspection, the 95% intervals look goodâ€”they should be about +/- 1.96 sd and theyâ€™re (0.1, 0.3).

You can see from Rhat and n_eff that theta didnâ€™t mixâ€”two or more chains got stuck in different modes.

The way our adaptation works within a single chain actually mitigates the most problematic aspect of transforming a multimodal prior, namely that itâ€™s likely that if alpha is heavily constrained, then alpha_raw wonâ€™t try to switch modes in the posterior during warmup, so adaptation will be OK. If a chain would explore both modes, then we run into a problem with adaptation trying to account for them both, which wonâ€™t be nearly as efficient.

Thanks for the very helpful info and the example, I will give it some more thought. The posterior on rho should be unimodal, and all of my tests are looking very much like your example in terms of Rhat and n_eff.

Glad this is of broader interest to some, this is exactly what I am wondering too.