Sampling from the prior - why am I seeing divergent transitions?

Hi
I’m (trying) to work with the following survival model. My goal is to obtain prior distributions using Stan, but I’m getting several divergent transitions (see below). I observed this behavior when working with a hierarchical model, but was able to solve the problem by using non-centered parameterization. I’m puzzled because this model is not (I think) hierarchical.

Why is this happening and how can I solve this issue? I can share data if necessary.

data {
    //Dimensions
  int<lower=1> N;
  int<lower=1> N_site;
  int<lower=1> N_plot;
  int<lower=1> N_cover;
 
  //Survival stuff
  vector[N] y;
  vector[N] y1;
  vector[N] y2;
  int<lower=1,upper=2> cens[N];

  //Explanatory variables
  vector[N] LAI;
  vector[N] OM;
  vector[N] pH;
  vector[N] SM;
  vector[N] ST;
  int<lower=1,upper=5> cover_id[N];
  
  //group variables
  int<lower=1> site_id[N];
  int<lower=1> plot_id[N];

}
parameters {
  //intercept
  real a;
//varying intercepts
  vector[N_site] a_site;
  vector[N_plot] a_plot;
  vector[N_cover] a_cover;

   //slopes
   real b_om;
   real b_ph;
   real b_sm;
   real b_st;
   real<lower=0> k;
}
transformed parameters {
    vector[N] mu=exp(a + a_site[site_id]+a_plot[plot_id]+ a_cover[cover_id]+b_sm*SM+b_st*ST+
    b_om*OM+b_ph*pH);
   vector[N] lambda= mu / tgamma(1 + 1 /k);
}
model {
  //Likelihood
  //for (i in 1:N)
  //if ( cens[i] == 1 ) target += weibull_lccdf(y[i] | k, lambda[i]);
  //else target += log_diff_exp(weibull_lcdf(y2[i] | k, lambda[i]),weibull_lcdf(y1[i] | k, lambda[i]));
 
 //Priors
  a ~ normal(0, 1);
  k ~ gamma(0.5, 0.5);
  a_site ~ normal(0,1);
  a_plot ~ normal(0,1);
  a_cover  ~ normal(0,1);
  b_om ~ normal(0,1);
  b_ph ~ normal(0,1);
  b_sm ~ normal(0,1);
  b_st ~ normal(0,1);
  
}
generated quantities{
  //vector[N] log_lik;
  vector[N] surv;
  for(i in 1:N){
  surv[i] = weibull_rng(k, lambda[i]);
}
}
library(cmdstan)
mod <- cmdstan_model("survival_prior.stan")
fit <- mod$sample(
  data = data_list,
  chains = 4,
  parallel_chains = 4
)

Warnings:

Warning: 385 of 4000 (10.0%) transitions ended with a divergence.
This may indicate insufficient exploration of the posterior distribution.
Possible remedies include: 
  * Increasing adapt_delta closer to 1 (default is 0.8) 
  * Reparameterizing the model (e.g. using a non-centered parameterization)
  * Using informative or weakly informative prior distributions
2 Likes

Changing the prior for “k” seems to solve the issue of the divergent transitions.

k ~ gamma(2, 0.5);

1 Like

Divergent transitions can happen with any posterior, doesn’t have to be for a hierarchical model. Even though the parameters are a priori independent, Stan is sampling the 6 + N_site + N_plot + N_cover dimensional distribution.

2 Likes

In this case (and I think this is broadly true when parameters are completely independent, in the sense that the gradient for a given parameter never depends on any other other parameters), the global problems are no worse than the problems for the margins. In this case, we get divergences even if we restrict the model to a single tricky margin.

parameters {
  real <lower=0> y;
}
model {
  y ~ gamma(0.5, 0.5);
}

Yields

Warning: 8 of 4000 (0.0%) transitions ended with a divergence.

Given that the remainder of the model doesn’t yield divergences, the number of divergences can be larger or smaller depending on the remainder of the multivariate model, but this is less because the rest of the model makes the posterior fundamentally easier or harder to sample from, and more because the rest of the model influences the step-size adaptation.

5 Likes

The infamous gamma(0.5, 0.5) is commonly recommended as a “non-informative” prior model but upon closer inspection isn’t actually pretty nasty.

On the nominal scale the gamma(0.5, 0.5) density function looks something like

Recall that Stan explores an unconstrained space, in this case log(y). The corresponding probability density function for log(y) is (I think I got the Jacobian correction right here…)

Because the peak is so far away from 0, around which Stan’s sampler is initialized by default, it will take a long time for the sampler to get to the peak. If it doesn’t get there fast enough then Stan’s adaptation will be informed by tail behavior and not the behavior within the typical set; once the sampler finally reaches the peak that stale adaptation can be ill-suited to the geometry there and cause divergences. The asymmetric shape of the density function around that peak can also stress the adaptation.

When constructing posterior distributions the realized likelihood function will often regularize much of this behavior. That should be no excuse, however, for such an awkward prior model.

2 Likes

I thought the usual “non-informative” prior was gamma(0.001,0.001) (which is even worse!).

I must say I am rather skeptical of any plot whose x-axis goes from \log y = -1000 to \log y = +1000, even if the only discernible feature is a single peak. Doubly so, if that feature happens to be suspiciously close to \log y \approx -750, i.e. the region where floating-point arithmetic typically over-/underflows.

My own calculation gives

\begin{eqnarray*} \mathrm{d}P & = & \frac{\beta^{\alpha}}{\Gamma\left(\alpha\right)}y^{\alpha-1}e^{-\beta y}\mathrm{d}y & = & \frac{0.5^{0.5}}{\Gamma\left(0.5\right)}y^{0.5-1}e^{-0.5y}\mathrm{d}y\\ & = & \frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{y}}e^{-\frac{1}{2}y}\mathrm{d}y & = & \frac{1}{\sqrt{2\pi}}\sqrt{y}e^{-\frac{1}{2}y}\frac{\mathrm{d}y}{y}\\ & = & \frac{1}{\sqrt{2\pi}}\sqrt{e^{z}}e^{-\frac{1}{2}e^{z}}\mathrm{d}z\\ & = & \frac{1}{\sqrt{2\pi}}e^{\frac{1}{2}z-\frac{1}{2}e^{z}}\mathrm{d}z \end{eqnarray*}

and the peak is a bit closer to zero
prob
I suspect the divergencies caused by the steep cliff on the right. It’s even more pronounced when plotting log-probability
logprob

5 Likes

Conventions vary from field to field. The commonality is a prior density of the form \text{gamma}(y; \epsilon, \epsilon) or even \text{inv-gamma}(y; \epsilon, \epsilon) so that the density concentrates against y = 0 instead of suppressing it. The conventional value of \epsilon changes across time and disciplines.

Yes, while I had the Jacobian correct in my haste to get an answer out I made the cardinal sin of naive multiplication on the linear scale, dgamma(y, epsilon, epsilon) * exp(y) instead of canceling everything on the log scale to avoid floating point issues. Implementing the log density properly gives a result that agrees with yours,

Indeed the problem isn’t the location of the typical set relative to the initialization but rather the asymmetry in the log density function. If warmup spends too much time on the left then the adaptation will end up in too aggressive of a step size configuration which can lead to occasionally divergences when exploring the sharp drop off on the right. The stochastic nature of the adaptation can also lead to some fits exhibiting divergences and some fits not exhibiting divergences.

Increasing adapt_delta forces a less aggressive step size adaptation which should help, as does changing the prior model to soften that drop off in the log density function.

2 Likes