Step size chosen during adaptation is too small when using link function

I’ve been investigating the use of link functions for improving the accuracy of posterior estimation in rare event problems. I have already successfully generated results with stan for a Poisson regression problem using an inverse gamma link function.

I am now tackling the problem of a using an inverse beta link function in a logistic regression problem. Here is my code:

functions {
  real invBetaSpec(real x);
  real normal_lcdf3(real x);
}

data {
  int N; //Num trials
  int M, //Num covariates
  int y[N]; //Observations,
  
  // covariates
  matrix[M,N] X;
}

parameters {
  row_vector[M] beta;
  real beta0;
}

transformed parameters {
  row_vector[N] alpha;
  vector[N] not_Logistic_alpha;
  
  alpha = beta*X;
  
  alpha += beta0;
  
  for (n in 1:N){
    not_Logistic_alpha[n] = invBetaSpec(exp(normal_lcdf3(alpha[n])));
  }
}

model{
  beta ~ std_normal();
  beta0 ~ std_normal();
  
  y ~ bernoulli(not_Logistic_alpha);
}

The problem I get is that at least one of the chains usually gets stuck with a small step size during the warmup adaptation period, especially for large N (>100). This results in sampling times of several hours followed by lots of max_treedepth warnings and poor mixing. On lucky occaisons and with small N there can be times when each chain adapts to a suitably large enough step size and the sampling finishes within a minute or so without any warnings. Reducing adapt_delta helps in some cases but the problem still occurs for large N.

To help speed things up I have already written the C++ functions invBetaSpec and normal_lcdf3 which define the inverse beta and normal log cdf functions for a specific set of parameters. In both cases the gradient is also defined (for invBetaSpec this is via a lookup table as an analytic gradient doesn’t exist). However, none of this seemed to improve the situation.

The equivalent model using bernoulli_logit() works without problems on the same data, even for large N. I thought the issue might be with the topology of the problem when using bernoulli(), so I wrote code for the equivalent model without the link function:

data {
  int N; //Num trials for each group
  int M, //Num covariates
  int y[N]; //Observations,
  
  // covariates
  matrix[M,N] X;
}

parameters {
  row_vector[M] beta;
  real beta0;
}

transformed parameters {
  row_vector[N] alpha;
  vector[N] logistic_alpha;
  
  alpha = beta*X;
  
  alpha += beta0;
  
  for (n in 1:N){
    logistic_alpha[n] = 1/(1+N*exp(-alpha[n]));
  }
}

model{
  beta ~ std_normal();
  beta0 ~ std_normal();
  
  y ~ bernoulli(logistic_alpha);
}

This model runs without any problems. So it seems that the link function alone is somehow responsible for the issue in step size selection during adaptation. Any ideas why this might be and how I might fix it?

1 Like

@nhuurre usually has very good intuition into these problems. Maybe he can help?

Difficulties in stepsize adaptation may arise if the function is discontinuous or the gradient calculation is not accurate enough. You say the gradient comes from a lookup table so I guess that’s an obvious suspect here.

I think Stan has some function for comparing the autodiffed gradient to finite differences. CmdStan has diagnose method and PyStan says something about test_grad argument but none of that seems to be documented.

1 Like

Good catch. It’s difficult to comment more as the code for the functions is visible.

Can you post traceplot of the warmup iterations for some parameters?

If the problem is not due to discontinuities, then diagnostics from campfire - New adaptive warmup proposal (looking for feedback)! - might be able to tell more or at least the adapation might work better. If you try campfire, post the diagnostic results here and @bbbales2 can interpret them for you,

Sorry, I should have been clearer about the implementation. Originally the invBetaSpec function didn’t have a defined gradient and the issue was still present. I implemented the gradient in the hope that it might speed up the sampling enough that I could solve the issue by simply generating more samples.

I’ll have a look at the campfire diagnostics but as requested here is the traceplot of the warmup:

And here is the traceplot of the equivalent model:

1 Like

Have you looked what kind of values you get with the initial values being in the range [-2,2]? Can intial value for beta0 be such that there are numerical problems in invBetaSpec(exp(normal_lcdf3(alpha[n]))) ?

1 Like

Okay I’ve had a look at the initial values.

I had spent quite a bit of time writing a new implementation of the normal_lcdf for the purpose of this problem (see https://github.com/stan-dev/math/pull/1411) and was confident that there would be no longer be any numerical issues.

The max value of normal_lcdf3(alpha[n])) was -8.727682e-17 and the minimum was -45.79481. Over the whole warmup period the maximum values for each chain were:

-2.579517e-13
-2.124914e-13
-1.916065e-07
-8.727682e-17

And the minimum values were:

-32.99099
-45.56727
-45.79481
-42.87631

So I’m still confident that the issue isn’t with the normal log cdf. Now lets look at these values after applying exp():

1.000000e+00
1.000000e+00
9.999998e-01
1.000000e+00

4.701042e-15
1.623258e-20
1.292903e-20
2.393611e-19

Hmm… so it looks like the floating point arithmetic is running out of decimals to distinguish the input values close to 0. Do you think this might be the culprit?


UPDATE:

I added another variable and attached an appropriate limit to it.

functions {
  real invBetaSpec(real x);
  real normal_lcdf3(real x);
}

data {
  int N; //Num trials
  int M, //Num covariates
  int y[N]; //Observations,
  
  // covariates
  matrix[M,N] X;
}

parameters {
  row_vector[M] beta;
  real beta0;
}

transformed parameters {
  row_vector[N] alpha;
  vector[N] not_Logistic_alpha;
  real<upper=-1e-7> exp_val;
  
  alpha = beta*X;
  
  alpha += beta0;
  
  for (n in 1:N){
    exp_val = normal_lcdf3(alpha[n]);
    not_Logistic_alpha[n] = invBetaSpec(exp(exp_val));
  }
}

model{
  beta ~ std_normal();
  beta0 ~ std_normal();
  
  y ~ bernoulli(not_Logistic_alpha);
}

Just tried running and I’m still waiting for the first 100 warmup samples so I’m assuming that wasn’t the cause!

That depends on how close invBetaSpec(1.0) and invBetaSpec(0.999998) are. If they’re approximately equal then it’s probably fine.

Too bad the limit doesn’t do anything, see e.g. this. You could use fmin(-1e-7, exp_val) instead.

I was suspecting exp and based on the values there is not enough accuracy close to 1. Can you rewrite normal_lcdf3 so that it would take the log value and possibly use numerically more stable computations inside.

That is quite big range.

Yeah what Aki said. Doing anything on the probability scale can cause problems. The bernoulli_logit thing is there to avoid doing exps as much as possible (the regular bernoulli thing breaks all the time).

Thanks for the comments everyone.

The values are “approximately” equal but the gradients aren’t, as you can see below:

I’m guessing you meant rewriting invBetaSpec() to get rid of the need for exp(). The function is simply a wrapper for the ibeta() function from boost. I’ve had a look at how that is implemented and they refer to this paper: https://dl.acm.org/doi/10.1145/131766.131776

For values of x far from 0 they recommend using symmetries and to compute the integral using 1-x instead. However, this doesn’t help the numerical problems that occur when x\approx 1. I can’t see a way to substitute \log x for x right now but I’ll keep thinking.

I managed to make use of log1m_exp() for the Poisson example I had and didn’t experience the same problems. So yeah it most likely is the use of exp() which is the problem.

1 Like

Stan already has inc_beta which is a simple wrapper for boost::math::ibeta (and also beta_cdf).