Model gets stuck - but not quite always

I have written quite a simple model in which the likelihood is a function of three parameters \mu, \sigma, \xi that are derived by a 1-1 continuous transformation of three independent parameters q_{1}, q_{2}, q_{3} each of which is given a gamma prior. I have written a custom probability function.

When I have tested this Stan program with 400 iterations and one chain, it occasionally converges very quickly to what I know by other means is definitely the right result, but mostly the sampler quickly locks on to arbitrary values and stays there, reporting that maximum tree depth has been reached.

I have just tried again with four chains. One of them mixed well and gave results that I know are reasonable. The other three did not. I looked at the summary sampler parameters in shinystan. The good chain shows:
accept_stat = 0.9989
stepsize = 0.0159
treedepth = 6.725
n_leapfrog = 206.5
energy = 5.06

By contrast, for example, the figures for one of the other chains were:
accept_stat = 0.11
stepsize = 0.0000
treedepth = 5.1
n_leapfrog = 53209
energy = -366

All three bad chains reported numerous cases of maximum treedepth being reached (I had set it to 20).
The Stan program is:

functions {

  vector lambda(vector y, real mu , real sigma,real xi){
    vector[rows(y)] result;
    real comp;
    if(fabs(xi) > 1e-15){
      real inv_xi = inv(xi);
      for(i in 1: rows(y)){
        comp = 1 + xi * (y[i] - mu) / sigma;
        result[i] = comp > 0 ? -(1 + inv_xi) * log(comp) - log(sigma): 0.0;
        }
      }
    else{
      for(i in 1: rows(y)){
        comp =  (y[i] - mu) / sigma;
        result[i] = comp > 0 ? comp - log(sigma): 0.0;
        }

      }
    return result ;
  }

  real Blambda(
  real mu,
  real sigma,
  real xi,
  real nyears,
  real u
  ){
  real inv_xi = inv(xi);
  real term1;
  if (fabs(xi) > 1e-15){
    term1 = 1 + xi*(u - mu) / sigma;
  }
  else{
  term1 = exp((u - mu) / sigma);
  }
term1 = term1 > 0 ? - ((term1) ^ (- inv_xi) )* nyears: 0;
 
return term1;
}
real poispt_lpdf(
  vector y,
  real mu,
  real sigma,
  real xi,
  real nyears,
  real u
  ){
 
 real L;
 real l;
 L = Blambda(mu, sigma, xi, nyears, u);
 l = sum(lambda(y, mu, sigma, xi));
 return L + l;
}
}

data {
  int<lower=0> N;
  vector[N] y;
  real nyears;
  real u;
}
parameters {
  real<lower = 0> qtilde1;
  real<lower = 0>  qtilde2;
  real<lower = 0>  qtilde3; 
}
transformed parameters{
  real mu;
  real xi; 
  real<lower = 0> sigma;
  real<lower = 0> nu;
  nu = qtilde3 / qtilde2;
  xi = log10(nu);
  sigma  = qtilde2 * xi / (nu * (nu - 1));
  mu = qtilde1 - qtilde2 / nu;
  }
model {
   // priors
  qtilde1 ~ gamma(38.9, 0.67);
  qtilde2 ~ gamma(7.1, 0.16);
  qtilde3 ~ gamma(47.0, 0.39);
// includes Jacobian adjustment  
  target += poispt_lpdf(y | mu, sigma, xi, nyears, u) -
  log(xi) + log(nu) + log(nu - 1) + log(qtilde3); 
}

First, you don’t need a Jacobian adjustment here as you are putting priors directly on the qtilde parameters. You only need the jacobian when you have a parameter theta and try to write f(theta) ~ ... for some non-linear transform f. There’s more discussion in the manual.

You might also want to work on stanardizing the scales of the qtilde. The means are in double digits, which makes adaptation do the work to figure that out. But the gamma distribution isn’t a location-scale family, so it isn’t so easy.

Also, you can declare-define, so the transformed parameters can be simplified to

real<lower = 0> nu = qtilde3 / qtilde2;
real xi = log10(nu);
...

That indicates a very poorly conditioned model. It means you had to choose a very small step size to get decent acceptance, which then caused huge number of steps (treedepth of 20 implies one million log density and gradient evals).

Many thanks for the advice. The purpose of the program is to examine the sensitivity of the results to various priors as alternatives to those used in a published paper. The priors encode specific expert opinion, which I am obviously unable to change, and I must use the same priors to set up my baseline for comparison with alternatives.

Given that I cannot change the priors, is there anything I can do to yield a more stable Stan model? Is there some way of guiding the adaptation?

You can try starting with smaller step sizes, include a higher acceptance rate, and run for more warmup iterations.

A better way to fix the geometry is to reparameterize. This usually won’t change the model inferences, but I don’t know how to do the unit-scale transforms for gamma distributions—they don’t have nice linearity properties like the location-scale families.

But it looks like there’s some pathology in the model that’s causing extreme curvature in the posterior. I would be skeptical that other software was able to fit this model. The way to test is with simulation-based calibration (formerly known as the Cook-Gelman-Rubin method).

Also, the if (fabs(xi) > 1e-15) line can cause problems if the result is not continuously differentiable and xi really close to zero is possible. The comparison is going to wreak similar havoc (comp > 0) on continuous differentiability.

Is the model as written even identifiable with the transforms?

Thanks again. I will have a go at simulation-based calibration. I am in the frustrating position of knowing that this model (or, at any rate, one with identical parameterization and transforms) can be fitted but I don’t know how in detail, as the code, to which I do not have access anyway, was in C++.

The test for small xi is inspired by Aki Vehtari’s notebook on the generalised Pareto distribution, but I will see what happens if I ignore the very small chance that xi will get that small.

On what do you base that knowledge?

The reason I ask is that a lot of people report fitting models but are not in fact exploring the whole posteriors. Unless they did something like simulation-based calibration, I wouldn’t trust it without validating with simulated results. That way you know the right results, and can check whether they’re being recovered.

A fair question.

I am seeking to reproduce using Stan the results of a much cited paper by Coles and Tawn from 1996. That paper does not specify precisely the methods used but implies that they included a custom-written MCMC algorithm. The same data are used as an example in the R package evdbayes, and the methods are, again, not specified in detail (being coded in C++ under the hood). Examination of the resultant chains suggests to me that some kind of Gibbs sampler is being used. More recently, using the same data, importance sampling yields results that are consistent with the 1996 paper and evdbayes.

Using Stan and the transforms as I have derived them (they were not stated in the 1996 paper or in evdbayes) my priors for the mu, sigma, and xi parameters appear to be consistent with earlier work. And in the rare cases in which my Stan code delivers stable results they are completely consistent with the earlier work - provided that I use a Jacobian adjustment. If I do not the results are not consistent with earlier work.

I have read the Stan manual on Jacobian adjustments several times. The section is perhaps not the clearest part of the manual, but having done so I concluded, as you state, that the Jacobian adjustment is not needed. In my first attempts to code this in Stan I did not include it and obtained stable results that were quite inconsistent with all the earlier work to which I have referred.

I am most grateful to you for referring me to the recent work on simulation-based calibration. What is handicapping me at the moment in applying it in my case is that I not yet have a stable means of drawing samples from the simulated data sets that SBC requires.

You should be able to run that in forward mode by rewriting your program using rngs in the generated quantities block.

Sorry about that. That’s mainly due to my own fuzziness. If you have suggestions for how it can be improved, I’d love to hear them.

Basically, the use of Jacobians in Stan is not something Stan specific. It’s just about the density that is being defined.

I still don’t see why the Jacobians are necessary here. Usually, you need a Jacobian adjustment when you want to convert a pdf of X to a probablity pdf of Y when there’s a nonlinear transform. Here, you’re not doing this. For example, to get a lognoraml distribution for alpha, you need a Jacobian in the following program:

parameters {
  real<lower = 0> alpha;

model {
  log(alpha) ~ normal(0, 1);
  target += -log(alpha); 

But you do not need a Jacobian when you do it the other way around:

paramters {
  real log_alpha;

model {
  log_alpha ~ normal(0, 1);
  alpha = exp(log_alpha);
  // adding Jacobian of exp(log_alpha) produces wrong result
}
1 Like