Consistency of inverse metric across chains

I’m fitting a pretty large model (~40K parameters), and I’ve noticed that the inv_metrics of different chains (within a given run) are completely uncorrelated. When I checked this on simple example models (non-centered eight-schools, etc), I got correlations of around 0.9 between different chains (cor(fit$inv_metric(matrix=F)[[1]], fit$inv_metric(matrix=F)[[2]])).

This also happened on runs where other convergence criteria (ESS, Rhat) were very good and showed no sign of convergence issues. Tree depth of 10 is often saturated for this model, but not always.

Is this expected in such a large model? Does this necessarily indicate the warmup “hasn’t converged”?

4 Likes

I had something similar happen when fitting a Gaussian Process model with a few thousand parameters, but it did have an impact on convergence of the different chains (I believe it was the main factor, but there could have been others).

To that my answer would be “possibly”. I am not familiar with the outputs of rstan, but I’m assuming that is not a trace of metrics, only a single matrix (the one fixed for the sampling period), and the correlation is among the matrix entries, not the same matrix entry across several warmup iterations, which may give a better results due to a larger sample size. However, it is not obvious to me that there should be some expected and high correlation between chain metrics, and that it will hold when scaling up to large models. I would have to think about what metric correlation between chains means, how to best calculate it, and how it’s expected to scale.

As far as I understand the process of computing the space metric, or mass matrix, is a very empirical one: just take the expectation of the Kronecker product of the parameter vector for (some part of) the warmup iterations, so it is sensitive to what’s happening there and it is possible that it “doesn’t converge” in the sense that difference chains may be at different regions at that point and end up with different mass matrices. However, it is also possible that, though different, they are still “good enough” to explore the space: they would just explore the parameter space more or less efficiently (maybe your are seeing the effect on run times, but it’s also hard to say what’s a good or bad metric to say how they should impact performance). Step sizes also differ between chain, but may still work, you could ask the same question about how similar they should be across chains.

For my problem, I considered using other criteria (for instance, average posterior probability of the chain, run time, etc) to decide what the good mass matrix was and use that as an argument for a fixed metric throughout all chains. I guess that could have worked, but in practice I was able to break down the problem into several small problems and run the separately. My guess is in your case you are trying to fix what’s not broken, but maybe others will have other ideas or opinions.

Isn’t the ideal metric just the covariance matrix computed from samples from the posterior? That’s what I’ve been using an most of the time it works every time. Thus I believe you would expect the matrices to converge across chains, up to sampling error.

1 Like

Not in general. For a diagonal mass matrix you could get a lack of correlation if all (edit: many/most) of the true marginal posterior variances are quite similar, so that the variation across parameters within a single chain is dominated by estimation error. One way you could get this is if you have a weakly informed random effect in the model, as below.

As above, not necessarily. The way that this might not be the case is if the expected estimation error in the marginal posterior variances is high compared to the actual variation in the true posterior variances across parameters, i.e. if all of the parameters are on the same scale in the posterior.

Edit: here’s an oversimplified example

parameters{
  vector[100] mu;
}
model{
  mu ~ std_normal();
}
library(cmdstanr)
n_mod <- cmdstan_model("...normal100.stan")
n_fit <- n_mod$sample()
plot(diag(n_fit$inv_metric()[[1]]) ~ diag(n_fit$inv_metric()[[2]]))

Well, yes, but requiring the posterior distribution to compute a metric to be able to sample from the posterior defeats the purpose of sampling, right?

To sum up what I was trying to get at, the issue is: given that the mass matrix has to be approximated during warm up together with all other optimizations, how good is this empirical estimation of this metric and how well should it correlate across chains (especially when the parameter space is large)?

1 Like

Oh, right, between chains correlation is just not the right metric to assess whether the metrics have converged towards the same “true” metric.

@Adam_Haber I guess a better “metric” (to check warmup convergence) would be to check that all metrics (covariance matrices) converge towards the metric computed from the pooled draws across chains.

1 Like

Terms like “warmup” and “converge” are often abused, so we have to be careful before using them. For the subtleties of warmup see Markov Chain Monte Carlo in Practice and for the many different meaning so convergence see Markov Chain Monte Carlo in Practice.

The majority of Stan’s warmup routine is dedicated to iteratively adapting the Hamiltonian Markov transition to the target distribution. The inverse metric elements are set to empirical estimators of the parameter variances, and the variation of these estimators depends on how quickly the early stages of the sampler are able to explore the target distribution. The slower the exploration the smaller the effective sample size, and the larger the standard error for variance estimates derived from the realized Markov chains.

Weak correlations in these variance estimators suggests that they’re dominated by noise, which then suggests poor exploration. This is supported by the treedepth warnings which imply that the dynamic Hamiltonian Monte Carlo sampler has more to explore at each iteration but is being prematurely truncated which should result is larger autocorrelations and smaller effective sample sizes.

That said all of this assumes a Markov chain Monte Carlo central limit theorem which may or may not be valid for the main sampling phase let alone early in warmup. For example if the target distribution is heavy-tailed (Cauchy Cauchy Cauchy) then the variance might not be well-defined in which case the estimators don’t actually converge to anything.

A final warning is that the correlation coefficient being evaluated here should be interpreted with caution. The Pearson correlation coefficient (which cor evaluates by default) assumes independence and the variance estimators for the different parameters are definitely not independent.

1 Like