I’ve been investigating the use of link functions for improving the accuracy of posterior estimation in rare event problems. I have already successfully generated results with stan for a Poisson regression problem using an inverse gamma link function.
I am now tackling the problem of a using an inverse beta link function in a logistic regression problem. Here is my code:
functions {
real invBetaSpec(real x);
real normal_lcdf3(real x);
}
data {
int N; //Num trials
int M, //Num covariates
int y[N]; //Observations,
// covariates
matrix[M,N] X;
}
parameters {
row_vector[M] beta;
real beta0;
}
transformed parameters {
row_vector[N] alpha;
vector[N] not_Logistic_alpha;
alpha = beta*X;
alpha += beta0;
for (n in 1:N){
not_Logistic_alpha[n] = invBetaSpec(exp(normal_lcdf3(alpha[n])));
}
}
model{
beta ~ std_normal();
beta0 ~ std_normal();
y ~ bernoulli(not_Logistic_alpha);
}
The problem I get is that at least one of the chains usually gets stuck with a small step size during the warmup adaptation period, especially for large N (>100). This results in sampling times of several hours followed by lots of max_treedepth warnings and poor mixing. On lucky occaisons and with small N there can be times when each chain adapts to a suitably large enough step size and the sampling finishes within a minute or so without any warnings. Reducing adapt_delta helps in some cases but the problem still occurs for large N.
To help speed things up I have already written the C++ functions invBetaSpec and normal_lcdf3 which define the inverse beta and normal log cdf functions for a specific set of parameters. In both cases the gradient is also defined (for invBetaSpec this is via a lookup table as an analytic gradient doesn’t exist). However, none of this seemed to improve the situation.
The equivalent model using bernoulli_logit() works without problems on the same data, even for large N. I thought the issue might be with the topology of the problem when using bernoulli(), so I wrote code for the equivalent model without the link function:
data {
int N; //Num trials for each group
int M, //Num covariates
int y[N]; //Observations,
// covariates
matrix[M,N] X;
}
parameters {
row_vector[M] beta;
real beta0;
}
transformed parameters {
row_vector[N] alpha;
vector[N] logistic_alpha;
alpha = beta*X;
alpha += beta0;
for (n in 1:N){
logistic_alpha[n] = 1/(1+N*exp(-alpha[n]));
}
}
model{
beta ~ std_normal();
beta0 ~ std_normal();
y ~ bernoulli(logistic_alpha);
}
This model runs without any problems. So it seems that the link function alone is somehow responsible for the issue in step size selection during adaptation. Any ideas why this might be and how I might fix it?
Difficulties in stepsize adaptation may arise if the function is discontinuous or the gradient calculation is not accurate enough. You say the gradient comes from a lookup table so I guess that’s an obvious suspect here.
I think Stan has some function for comparing the autodiffed gradient to finite differences. CmdStan has diagnose method and PyStan says something about test_grad argument but none of that seems to be documented.
Good catch. It’s difficult to comment more as the code for the functions is visible.
Can you post traceplot of the warmup iterations for some parameters?
If the problem is not due to discontinuities, then diagnostics from campfire - New adaptive warmup proposal (looking for feedback)! - might be able to tell more or at least the adapation might work better. If you try campfire, post the diagnostic results here and @bbbales2 can interpret them for you,
Sorry, I should have been clearer about the implementation. Originally the invBetaSpec function didn’t have a defined gradient and the issue was still present. I implemented the gradient in the hope that it might speed up the sampling enough that I could solve the issue by simply generating more samples.
I’ll have a look at the campfire diagnostics but as requested here is the traceplot of the warmup:
Have you looked what kind of values you get with the initial values being in the range [-2,2]? Can intial value for beta0 be such that there are numerical problems in invBetaSpec(exp(normal_lcdf3(alpha[n]))) ?
I had spent quite a bit of time writing a new implementation of the normal_lcdf for the purpose of this problem (see https://github.com/stan-dev/math/pull/1411) and was confident that there would be no longer be any numerical issues.
The max value of normal_lcdf3(alpha[n])) was -8.727682e-17 and the minimum was -45.79481. Over the whole warmup period the maximum values for each chain were:
Hmm… so it looks like the floating point arithmetic is running out of decimals to distinguish the input values close to 0. Do you think this might be the culprit?
UPDATE:
I added another variable and attached an appropriate limit to it.
functions {
real invBetaSpec(real x);
real normal_lcdf3(real x);
}
data {
int N; //Num trials
int M, //Num covariates
int y[N]; //Observations,
// covariates
matrix[M,N] X;
}
parameters {
row_vector[M] beta;
real beta0;
}
transformed parameters {
row_vector[N] alpha;
vector[N] not_Logistic_alpha;
real<upper=-1e-7> exp_val;
alpha = beta*X;
alpha += beta0;
for (n in 1:N){
exp_val = normal_lcdf3(alpha[n]);
not_Logistic_alpha[n] = invBetaSpec(exp(exp_val));
}
}
model{
beta ~ std_normal();
beta0 ~ std_normal();
y ~ bernoulli(not_Logistic_alpha);
}
Just tried running and I’m still waiting for the first 100 warmup samples so I’m assuming that wasn’t the cause!
I was suspecting exp and based on the values there is not enough accuracy close to 1. Can you rewrite normal_lcdf3 so that it would take the log value and possibly use numerically more stable computations inside.
Yeah what Aki said. Doing anything on the probability scale can cause problems. The bernoulli_logit thing is there to avoid doing exps as much as possible (the regular bernoulli thing breaks all the time).
I’m guessing you meant rewriting invBetaSpec() to get rid of the need for exp(). The function is simply a wrapper for the ibeta() function from boost. I’ve had a look at how that is implemented and they refer to this paper: https://dl.acm.org/doi/10.1145/131766.131776
For values of x far from 0 they recommend using symmetries and to compute the integral using 1-x instead. However, this doesn’t help the numerical problems that occur when x\approx 1. I can’t see a way to substitute \log x for x right now but I’ll keep thinking.
I managed to make use of log1m_exp() for the Poisson example I had and didn’t experience the same problems. So yeah it most likely is the use of exp() which is the problem.