Partial non-centered parametrizations in Stan

I am fitting a model that is something like (simplified here for the example)

x_{k,i} \sim N(\mu_k, \sigma), \qquad i = 1, \dots, N_k\\ \mu_k \sim N(\gamma, \delta)

with a N(0,10) vague prior on \gamma and a half-Cauchy on \delta and \sigma.

My problem is that I have about k = 1, \dots, 10^5 groups, and the number of observations in each, N_k, ranges from 1-2 to a few hundred.

Stan is working OK (I was kind of surprised that it deals with this many groups, but am using sufficient statistics by k), but as expected, if I set up a non-centered parametrization, it gives me a not-so-good mixing (ESS 30 out of 1000) in the k's where I have a lot of observations, and conversely if I use a centered parametrization the mixing for the groups with few observations suffer (and also the \mu_k).

I have been trying to read up on this, and found that there are so-called partially non-centered parametrizations (eg Papaspiliopoulos et al 2003), but my understanding is that they would involve tuning a parameter. I understand it conceptually, but I wonder how to do this in Stan, or if there is an alternative that works better with Stan.

My guess is that the closest analogue you could do in Stan would be a mixture of a centered and non-centered parameterization using a fixed weight in log_mix.

1 Like

Thanks, but I am not sure I understand the reason for modeling it as a mixture. Eg a form I have seen is

\mu_k = w \gamma + \tilde{\mu}_k\\ \tilde{\mu}_k \sim N((1-w) \gamma, \delta))

where \tilde{\mu}_k is now the parameter, which gives non-centered when w=1 and centered when w=0, but \mu_k is then deterministic given the parameter \tilde{\mu}_k.

Also, any good heuristics for choosing w for each group?

It is related to the ratio of the between group variance to the measurement error variance, which are both unknown so it is hard to say what to fix w to. Several years back, @betanalpha did some simulations where w was tuned during warmup and even that did not turn out particularly well.

You can’t tune w dynamically because there’s no easily-differentiable criteria that you can use to inform updates, let alone one that one can easily compute online. Ultimately you have to run multiple chains for different values of w and choose the one that performs best which is a pain.

What ended up not performing well were the partial centerings themselves. I used the parameterization

data {
  int<lower=1> N;
  real<lower=0> sigma;
  vector[N] y;

transformed data {
  real w;
  w = 0;

parameters {
  real mu;
  real<lower=0> tau;
  vector[N] theta_tilde;

transformed parameters {
  vector[N] theta;
    real tau_tilde;
    tau_tilde = pow(tau, 1 - w);
    theta = (1 - w * tau_tilde) * mu
            + tau_tilde * theta_tilde;

model {
  mu ~ normal(0, 10);
  tau ~ cauchy(0, 10);
  theta_tilde ~ normal(w * mu, pow(tau, w));

  y ~ normal(theta, sigma);

and considered the performance between \sigma \rightarrow 0 which models informative data and \sigma \rightarrow \infty which models non-informative data. In the attached plot you can see that the fully centered and non-centered parameterizations dominate: save for the very, very narrow crossover where partial centerings may be slightly better either the fully centered or fully non-centered performs best.

time_per_ess.pdf (18.7 KB)

Perhaps more relevant to your application, the entire hierarchy need not be centered or non-centered all at once. You center or non-center each individual in the hierarchy so you can gather your individual groups into “informative data” and “non-informative data”, implement the former with a centered parameterization and the latter with a non-informative parameterization, and see if that improves anything.

1 Like

Thanks. This is what I ended up doing, with N_k > 20 being “informative” and N_k \le 20 “non-informative”, and it works reasonably well. Maybe there is a better cutoff, I searched on a grid 5:5:100 and this was a reasonable pick. It improves the ESS for the badly mixing variables, not affecting the others, and it does not affect runtime either.

Incidentally, my understanding is that the mass matrix is adapted from the covariance of the sample during tuning, and for 10^5 parameters, I am wondering if it would make sense to restrict the mass matrix to diagonal during adaptation (as it must be mostly noise). But I couldn’t figure out if this is possible via the rstan interface, or at all (without programming C++).

The optimal cutoff will depend on the exact shape of the likelihood relative to your hyperpriors which won’t be something you can work out analytically so empirical analysis is your best best. Just keep looking at the divergences and, conditioned on that, the effective sample size per iteration or effective sample size per gradient calculation.

All of the Stan interfaces default to diagonal mass matrices already.

1 Like