Reparameterizing the Gamma

I am trying to sample from the erlang-2 (gamma with shape=2) distribution as a prior for a gaussian process, and seem to be running into issues with divergent transitions in the left tail of the distribution. (At least I think that is what is causing them - it seems that these extreme values are linked to the divergent transitions, and I’ve already switched to non-centered parameterizations to reduce the funnel).

At this point, I am working with simulated data from the generative process, so it isn’t due to model mispecification. I can eliminate them by cranking up adapt_delta to 0.99, but that slows things down somewhat.

The manual has suggestions for alternative parameterizations for the Cauchy and t-distribution but not for gamma.

Two Questions:

Is sampling from the gamma likely to cause divergent transitions in this way? It seems like the most likely cause given that most of the divergent transitions happen in the tail, but I’m not convinced there aren’t issues elsewhere.

If so, what would be the best way to reparameterize the model? There isn’t a closed form solution for the inverse CDF, so doing it the same way as the Cauchy may be difficult, however I could use the approximation from Lin 1988, JRSS. However, I’m not sure how good the approximation is, and it is not differentiable at 0.5. Alternatively, I could parameterize it as the sum of two exponential RV similar to the t example, but it seems like this would cause identifiability issues.

There isn’t much you can do to reparameterize the Gamma distribution. There was a thread a while back on the inverse CDF of the Gamma distribution, where the main difficulty was that there was no easy way to differentiate it with respect to the shape parameter. But if you are using a fixed shape parameter, then that isn’t an issue. You could also try increasing the rate parameter for the Gamma distribution and dividing a Gamma(2,1) by that rate.

I’m surprised by Ben’s response because this sometimes works well for me:

parameters {
  real mu_0;
  real sigma_0;

model {
  mu_0 ~ normal(0,sigma_log);
  sigma_0 ~ normal(0,sigma_log);

  vector[n_obs] mu;
  vector[n_obs] sigma;
  vector[n_obs] alpha;
  vector[n_obs] beta;
  for ( i in 1:n_obs )
    mu[i] = kappa[1] * exp(mu_0);
  for ( i in 1:n_obs )
    sigma[i] = kappa[2] * exp(sigma_0);
  for ( i in 1:n_obs )
    alpha[i] = (mu[i]/sigma[i])^2;
  for ( i in 1:n_obs )
    beta[i] = (sigma[i]^2)/(mu[i]);

  for ( i in 1:n_obs ) {
    target += gamma_lpdf_1S(t_delta[i], alpha[i], beta[i]);
    if (t_truncate[i] != positive_infinity())
      target += -gamma_lcdf_1S(t_truncate[i], alpha[i], beta[i]);

generated quantities {
  real g_alpha;
  real g_beta;
  g_mu = kappa[1] * exp(mu_0);
  g_sigma = kappa[2] * exp(sigma_0);
  g_alpha = (g_mu/g_sigma)^2;
  g_beta = (g_sigma^2)/g_mu;

… with a fixed shape you really shouldn’t be having any trouble so I’m guessing that your problem is somewhere else.

Yeah, I came across that thread. as well. It seemed like the main difficulty was that it required a line-search step, but that wouldn’t go away with a fixed shape parameter. @sakrejda mentioned the closed-form chi-squared approximation, but that is still an approximation for the chi-squared (though it appears a different one from my reference in the original post).

It seems like this would be the same model and wouldn’t address the need for varying stepsizes. Is the idea here just to make things more numerically stable for small values?

@sakrejda I’m a bit confused what is going on in your code. What is the gamma_lpdf_1S function?

I’m trying to get a bit more intuition behind this - to me, why wouldn’t we expect to see the same pathologies in the tails of the cauchy to come up in the tails of the gamma?

See the new Gaussian process section in the 2.16.0 Stan manual which discussed the extreme importance of selecting the lower tail of your length scale prior appropriately. Long story short, you better not have much mass below the minimum covariate distance or you will get unavoidable divergences.

1 Like

Thanks - I was aware of the identifiability problem but not the HMC specific pitfalls. Switching to the inverse gamma prior seems to help things somewhat. Am I reading it correctly that the model on page 250 is using the gamma prior, or did I miss a reciprocal? I’ve also been looking at Penalized Complexity Prior over both the length scale and the variance…

what about increase your shape parameter to concentrate the probability away from zero?

Just gamma

Gamma_lpdf, drop the suffix, don’t forget to use the bar.

This is not directly related to the non-identified ridge in squared exponential (really any Matern) kernel. Rather it is an interaction of the kernel with the observation model that allows the measurement variation to shink to zero so the the GP nearly interpolates between the points. This is such a good fit that the likelihood induces a large posterior mass below the minimum covariate distance, biasing the posterior towards this region. This behavior is also not specific to HMC, rather HMC is the only algorithm capable of identifying this problem.

The gamma has a longish tail on the left side so you have to be careful to ensure that P[rho < min_delta_x] ~ 0.05 or even lower. Inverse gamma has a sharper lower tail but a much heavier upper tail. The generalized inverse gamma has nice sharp tails on both sides but you’d have to write that as a function yourself, as suggested in the manual.

PC priors won’t help here as they explicitly do not avoid the default value, which here would be length scale = 0. For GPs with the squared exponential kernel you need a prior that is both zero avoiding and infinity avoiding.

So is this the basically issue as covariance matrices for latent variables then that is addressed in Chung, German, et al. 2015, but as applied to GP?

Not directly related.


The PC prior of Fuglstad corresponds to a base model with length scale of infinity - so it should deal with the left tail.

According to the manual, the infinity avoiding prior is only necessary necessary when the model also has mean and fixed effects. Is that still the current best practice?