Divergent transitions and low ESS in Bessel Regression using real data

Hello! I am utilizing Bessel Regression to model mean essay scores from a standardized exam across various Brazilian municipalities. The response variable is bounded between 0 and 1, and my covariates are municipal indices (economy, health, and education) that also range from 0 to 1. While the overall municipal index is an average of these three components, I am including each as a separate covariate in the model. Similar to the Beta distribution, the Bessel regression is well-suited for modeling data on the (0, 1) interval. When testing the model with synthetic data, Stan successfully recovers the parameters; however, when applied to my real dataset, the following errors occur:

1: There were 19466 divergent transitions after warmup. See
https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them. 
2: There were 278 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
3: Examine the pairs() plot to diagnose sampling problems
 
4: The largest R-hat is 1.16, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat 
5: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess 
6: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess 

I have included the R and Stan code, along with the dataset, below for reference.

dados1 = c(t(dados_enem_2018[,-1]))
N = 188 

X = as.matrix(tibble(rep(1,N),c(t(Economia_mg_2018[, -1])),c(t(Educacao_mg_2018[, -1])), 
                     c(t(Saude_mg_2018[, -1])) ))
V = X

stan_data1 <- list(
  N = N,
  y = dados1,
  M = 4,
  X = X,
  V = V
)

##################################

fit1 <- stan(file = "Modelo_Bessel.stan", data = stan_data1,
                 iter = 10000, warmup = 5000, chains = 4, cores = 4)

Stan:


functions{
  
  real zeta(real z, real mu) {
    return sqrt(1 + (((z-mu)^2) / (z*(1-z))));
  }
  
  real bessel_lpdf(real z, real mu, real phi){
    return log(mu)+log(1-mu)+log(phi)+phi-log(pi())-(3.0/2.0)*(log(z)+log(1-z))+
    log(modified_bessel_second_kind(1, (phi*zeta(z,mu)))) - log(zeta(z,mu));
  }
  
}

data {
  int<lower=0> N;
  int<lower=0> M;
  vector[N] y;
  matrix[N,M] X;
  matrix[N,M] V;
}

parameters {
  vector[M] kap;
  vector[M] lam;
}

transformed parameters{
  vector<lower = 0, upper = 1>[N] mu;
  vector<lower = 0>[N] phi;
  for (i in 1:N){
    mu[i] = exp(X[i]*kap)/(1+exp(X[i]*kap));
    phi[i] = exp(V[i]*lam);
  }
}

model {
  kap ~ multi_normal(rep_vector(0, M), (10*identity_matrix(M)));
  lam ~ multi_normal(rep_vector(0, M), (10*identity_matrix(M)));
  for (i in 1:N){
    y[i] ~ bessel(mu[i],phi[i]);
  }
}

dados_enem_2018.csv (5.7 KB)
Economia_mg_2018.csv (3.5 KB)
Educacao_mg_2018.csv (3.5 KB)
Saude_mg_2018.csv (3.5 KB)

I’ve never heard of Bessel regression, but the warning messages are telling you that you’re having problems with the step size and varying curvature. Generally, if models don’t work, running for 40K iterations isn’t going to help. You need to tackle the underlying geometry problem. This may be coming from the bezel function, which I don’t know anything about. I don’t even know if the log of the Bessel is going to be stable with those arguments.

You can speed this up dramatically. For example, replace the sampling for kap and lam with:

kap ~ normal(0, 10);
lam ~ normal(0, 10);

It just gives each member of the container and independent normal(0, 10) prior.

You can also speed up bessel_lpdf and make it more stable. Speed would come from taking in the full vector arguments and doing vector operations rather than scalars—it compresses the automatic differentiation tree and can have a big speedup.

The other thing you want to do is replace log(1 - mu) with log1m(mu) for numerical stability. You also don’t want to do things like evaluate zeta(z, mu) more than once as you do in the second line.

Then the transformed parameters can be move into the model block (presumably you don’t need to save them) and vectorized for efficiency using the built-in inverse_logit rather than coding it yourself (the built in will be more numerically stable and faster and it avoid evaluating the exponential twice).

vector<lower=0, upper=1>[N] mu = inv_logit(X * kap);
vector<lower=0>[N] phi = exp(V * lam);

This winds up using efficient matrix arithmetic rather than having to pull out row vectors with an allocation and non-local copy as is currently going on in transformed parameters.

You would get more stability probably using log_mu and log_phi as parameters because you wind up taking logs of them. We have a stable log1m_exp function that will help.

My overall suggestion would be to try something simpler than Bessel regression first and get the regression working, then see if you can swap in the Bessel thing.

2 Likes