Constraining the trace of a covariance matrix

I am interested in an inference problem where the posterior distribution is over positive semi-definite matrices whose trace is equal to 1. I have not begun working on this, but before I do, I would be interested to hear any opinions from those experienced users about how best to implement this, as in, what to put in the parameters section of the stan program. I understand there are constrained native types for cholesky factors, correlations, and covariance matrices.

I once wrote pair of inverse functions that transform between \mathbb{R}^{d^2-1} and the set of d\times d unit-trace positive semi-definite operators with the constraint based on stick-breaking (see here).
Would implementing this in stan be a useful approach, or is there a simpler way? Here is a blurb copied from the docstring of vec_to_state():

The unit trace condition is met by using a stick-breaking like trick, where
a given value of x doesn’t directly store a cholesky factor matrix element,
but rather, it stores (after being shrunk) the fraction of the remaining
budget of the trace until 1. Note that the trace of a positive operator
is the square Hilbert-Schmidt norm of its cholesky factor, so this is a
2-norm version of stick breaking.

Thanks for any help.

Could you just define the square of the scale vector to be a simplex, give it a Dirichlet prior, and proceed as normal?

Something like

data {
  int P;
}
transformed data {
  vector[P] alpha = rep_vector(1.0/P, P);
}
parameters {
  simplex[P] sqr_scale;
  corr_matrix[P] Omega;
}
transformed parameters {
  matrix[P, P] Sigma = quad_form_diag(Omega, sqrt(sqr_scale));
}
model {
  
  target += dirichlet_lpdf(sqr_scale  | alpha);
  target += lkj_corr_lpdf(Omega | 3);
}

In R then

compiled_model <- stan_model("trace_sums_to_1.stan")

fit <- sampling(compiled_model, data = list(P = 5), chains = 1)

sum(diag(matrix(unlist(as.data.frame(fit, pars = "Sigma")[1000,]), 5, 5)))

> 1

This puts no weight on scales of 1 or 0, which might not be desirable.

Cheers,
Jim

1 Like

Thanks, this is exactly what I am looking for! (though I will need to muck around a bit to simulate complex matrices).

That’s exactly what we do with correlation matrices under the hood, by the way. The full definition with Jacobian is in the manual in the chapter on constrained parameter transforms.

Neat!

That’s exactly what we do with correlation matrices under the hood, by the way.

Cool. It seems the main difference is that cholesky stick-breaking operates per-row, while trace constraining stick-breaking operates on all entries together.

While reading through, I found a minor typo in the subsection “Correlation Matrix Inverse Transform”: It reads that tanh maps into (0,1), rather than (-1,1).

Nice, Bob keeps an issue open for changes like this: next manual, 2.18(++) · Issue #15 · stan-dev/docs · GitHub , so go ahead and post it there!

1 Like

You’re right. I’ll fix for next release.

Thanks for following up with an issue, @ihincks—that way, it won’t get lost.

I have another fundamental question about the same problem. My prior distribution is based on a Ginibre ensemble, parameterized by K and D as follows:

data {
    int<lower=1> D;	                 // Dimension
    int<lower=1> K;                      // Number of Ginibre subsystems
}
parameters {
	matrix[D,K] X;
}
transformed parameters {
	matrix[D,D] rho;

	rho = X * X';
	rho /= trace(rho);
}
model {
	// Ginibre DxK prior
	for (idx_row in 1:D) {
		X[idx_row] ~ normal(0,1);
	}
}

This is very convenient to implement, just iid normals on X. However, say D=K, then there are more parameters than necessary: X requires D*D real numbers, but rho is symmetric with trace 1 and thus requires D*(D+1)/2-1 parameters. My guess is that this is a bad thing to do because, depending on the likelihood added to the model, it might become multimodal. Is this correct? Is it better to use a 1-1 and deal with the jacobian, or will I be okay with this more convenient-to-implement parameterization?

Thanks again.

(Note, I am still interested in the complex valued problem where X has real and imaginary components sampled iid from N(0,1), but I have stated it as above to be cleaner. James’ trick does not work as cleanly in the complex case and I think I would need to write the transform and its jacobian from scratch, and figure out the prior on rho, etc.)

I think this should be OK. It’s what we do for generating points on a sphere. There are only N-1 degrees of freedom for a sphere embedded in N dimensions, but we use N parameters, give each a normal distribution, then project to the sphere. That provides a uniform distribution on the points on the sphere at the cost of some unit normals. This algorithm comes from Marsaglia 1972.

By the way, to_vector(X) ~ normal(0, 1) should be more efficient than the loop since we can share a lot of computations across the rows. The copies have to happen anyway to pull the rows out as Eigen stores column major, so for big matrices, it’s even more of a win.

Sorry, missed that the first go `round.

Yes, this can be a problem with things like matrix multiplication because (-x)^2 = x^2.

In the sampling on a sphere case, there’s only a single line from the origin that results in the same point on a sphere, so nothing gets multimodal.

Without a way to control the multimodality of the parameters X, it can be problematic in the posterior monitoring X. On the other hand, it might not matter in that rho won’t wind up being multimodal.

Also, I’m not sure what distribution you get from X * X' given a normal distribution on the X components. What was your intended distribution on rho? The best thing to do would be to check whether that’s what you get with this coding.

I do not want to hijack this thread, but I have been looking for a clarification of a point here. Continuing the above example, if we only care about rho, and X is for some reason of no interest, do we need to worry about the posterior monitoring on X? In other words, if we get Rhat=1 for rho, can we safely ignore Rhat>1 for X?

Note: if more details are needed about my specific case, I can start a new thread to avoid derailing this one further.

The answer is complicated, as it depends on the properties of particular transforms.

In some cases, no, we don’t need to worry about it. For example, you can get away with this to generate a uniform prior in (0, 1) on alpha:

parameters {
  real<lower = -1, upper = 1> alpha_raw;
}
transformed parameters {
  real<lower = 0, upper = 1> alpha = fabs(alpha_raw);
}
model {
  alpha ~ normal(0.2, 0.05)
}

If the posterior for alpha concentrates away from 0, as in this example, the posterior on alpha_raw will be multimodal. This should be OK for alpha in this simple case. Here’s what the output looks like for 4 chains and default settings:

           mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
theta      0.00    0.14 0.21 -0.28 -0.20  0.01  0.20  0.29     2 4.42
theta_pos  0.20    0.00 0.05  0.10  0.17  0.20  0.23  0.30  1779 1.00
lp__      -1.23    0.01 0.69 -3.18 -1.41 -0.96 -0.78 -0.73  2188 1.00

By inspection, the 95% intervals look good—they should be about +/- 1.96 sd and they’re (0.1, 0.3).

You can see from Rhat and n_eff that theta didn’t mix—two or more chains got stuck in different modes.

The way our adaptation works within a single chain actually mitigates the most problematic aspect of transforming a multimodal prior, namely that it’s likely that if alpha is heavily constrained, then alpha_raw won’t try to switch modes in the posterior during warmup, so adaptation will be OK. If a chain would explore both modes, then we run into a problem with adaptation trying to account for them both, which won’t be nearly as efficient.

1 Like

Thanks for the very helpful info and the example, I will give it some more thought. The posterior on rho should be unimodal, and all of my tests are looking very much like your example in terms of Rhat and n_eff.

Glad this is of broader interest to some, this is exactly what I am wondering too.

I suspected there must be a better way…thanks!