Sampling from inverse gamma

A general question about sampling from challenging distributions.

For some varying intercept models, I sometimes need to use informative priors on the standard deviations of the so-called random effects. The inverse gamma works well when I want the standard deviation to be larger than the data might otherwise suggest, thanks to its enormous right tail. When I am using optimization, it tends to behave fairly well.

However with Stan, I find that I am getting a lot of divergent transitions, even with specifications like inv_gamma(.5,10). In reviewing the reparameterization section of the Stan User’s Guide, I noted the discussions about the benefits of reparameterizing Cauchy and Student_t distributions because of their large tails.

Just generally speaking, would people expect the inverse gamma to present similar sampling challenges to the Cauchy and student-t? If so, is there a reparameterization strategy analogous to the ones discussed in the Stan User Guide that might be worth exploring for the inverse gamma? Perhaps an even better question would be whether there are other distributions that could have a similar informative effect (encouraging larger standard deviations for varying intercepts) but which might be easier for Stan to sample.

I’d appreciate any thoughts. Thanks.

Jonathan

Does your model looks like:

a ~ normal(mu, sigma);
sigma ~ inv_gamma(0.5, 10);

Where ‘a’ and ‘sigma’ are parameters.

Have you non-centered ‘a’? If not, that’ll cause divergences. Non-centering and the whats, whys, hows here: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html

1 Like

Ben, thanks for this.

Courtesy of brms, the model looks more like this, and I do believe it is already non-centered by default:

  vector[N] mu = temp_Intercept + Xc * b;
  for (n in 1:N) {
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n] + r_3_1[J_3[n]] * Z_3_1[n];
  }
  // priors including all constants
  target += normal_lpdf(b | 0,1);
  target += student_t_lpdf(temp_Intercept | 3, 0, 10);
  target += inv_gamma_lpdf(sd_1 | .5,25);
  target += normal_lpdf(z_1[1] | 0, 1);
  target += inv_gamma_lpdf(sd_2 | .5,25);
  target += normal_lpdf(z_2[1] | 0, 1);
  target += exponential_lpdf(sd_3 | 1);
  target += normal_lpdf(z_3[1] | 0, 1);

  // likelihood including all constants
  if (!prior_only) {
    target += bernoulli_logit_lpmf(Y | mu);
  }
}

The challenge is that the prior on the random effects (sd_1 and sd_2 are the focus) needs to be informative, and the sample sizes are large, which means that the prior needs to have enough oomph (to use the official term) to be sufficiently influential. The inverse gamma works well in an optimization setting as specified above, but Stan seems to struggle to sample it even with larger tree sizes (the scaled inverse chi-squared doesn’t seem to necessarily do better). The log gamma works fine in INLA with analogous specifications to the inverse gamma, but that again is an optimization setting.

Perhaps it would be better to just focus on regular gamma distributions sufficiently concentrated and centered far enough from zero and see if Stan samples from those better? It may almost be worth a separate topic but I feel like the community has spent so much time trying to find the perfect weakly informative priors that the process of finding a good informative prior, both in terms of easy sampling and overall specification flexibility, is a topic that people don’t talk about enough, at least in terms of the sigma of a varying intercept.

Thanks again for these thoughts. In the meantime, I’ll see if the gamma can serve this purpose sufficiently.

Jonathan

Oh yeah good call, brms does the non-centering.

Why do you want to keep your standard deviation away from zero? What about just doing a zero centered normal with a big standard deviation?

Informative priors will probably make the sampler work better. If you can justify them, go for it!

Well, my thinking was that since the standard deviations on a modeled group have to be positive I should stick with positive-constrained distributions. But doing a half normal or something sure does sound like it would make sampling easier. I’ll give that a shot. I’ve also seen folks use the lognormal for this but that seems to be a controversial choice. Thanks again.

The argument for including zero (and putting significant mass there) is that zero is the version of the model where the random effects are gone – all the groups are the same. That’d be the easy case for modeling cause then you wouldn’t have these group level parameters – it’d just be one overall parameter.

The opposite of that is letting the standard deviation on the hierarchical parameter go off to infinity so everything is estimated on its own. That’s tough because for each parameter you might not have much data so your estimates will be really noisy.

And so we kinda want to squish things to zero to take advantage of all our data and keep the noise down, but we leave a little wiggle room for things to be different. I think that’s the handwavy picture.

1 Like