NaN E-BFMI warnings

Hello,

the latest version of my model has started throwing some “2 of 4 chains have a NaN E-BFMI” warning. Can anybody provide any tips to help me track down the problem (let alone fixing it!).

My current modelling pipeline involves fitting the model to the first half of my data, and then using a generated_quantities call to make predictions for the held-out second half of the data.

If I inspect the fitted model object, I get the following summary for the main parameters:

variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -1.5e+04 -1.4e+04 3.16 0.00 -1.4e+04 -1.4e+04 1 2019 NA
b_a[1] 0.0e+00 0.0e+00 0.05 0.04 -7.0e-02 7.0e-02 1 2268 1249
b_a[2] 1.7e+00 1.7e+00 0.53 0.50 8.3e-01 2.6e+00 1 1205 1398
b_stick[1] 6.9e-01 6.9e-01 0.21 0.21 3.4e-01 1.0e+00 1 720 1194
b_stick[2] 2.9e+00 2.9e+00 0.41 0.39 2.2e+00 3.5e+00 1 1404 1444
rho_delta[1] 1.2e+00 1.2e+00 0.08 0.07 1.1e+00 1.4e+00 1 859 1154
rho_delta[2] 9.1e-01 9.1e-01 0.03 0.03 8.7e-01 9.5e-01 1 1462 1369
rho_psi[1] -2.1e+00 -2.1e+00 0.24 0.24 -2.5e+00 -1.7e+00 1 1136 1292
rho_psi[2] -1.1e+00 -1.1e+00 0.14 0.13 -1.3e+00 -8.8e-01 1 1812 1694
sigma_u[1] 9.0e-02 8.0e-02 0.06 0.06 1.0e-02 1.9e-01 1 594 1010
sigma_u[2] 8.3e-01 8.1e-01 0.15 0.14 6.2e-01 1.1e+00 1 1637 1582
sigma_u[3] 3.0e-01 3.0e-01 0.06 0.06 2.2e-01 4.1e-01 1 1276 1326
sigma_u[4] 8.9e-01 8.6e-01 0.21 0.19 6.2e-01 1.3e+00 1 1241 1485
sigma_u[5] 2.1e+00 2.1e+00 0.45 0.42 1.5e+00 3.0e+00 1 1333 1594
sigma_u[6] 1.6e+00 1.6e+00 0.33 0.30 1.2e+00 2.2e+00 1 1470 1139
sigma_u[7] 9.0e-02 9.0e-02 0.02 0.02 6.0e-02 1.4e-01 1 1650 1557
sigma_u[8] 4.3e-01 4.2e-01 0.14 0.14 2.1e-01 6.8e-01 1 924 763

If I look further down, there are a load of NA values for my variance-covariance matrix, but I am assuming this is normal behaviour?

|L_u[1,1] | 1.0e+00| 1.0e+00| 0.00| 0.00| 1.0e+00| 1.0e+00| NA| NA| NA|
|L_u[2,1] | -1.2e-01| -1.3e-01| 0.29| 0.30| -5.9e-01| 3.7e-01| 1| 400| 650|
|L_u[3,1] | 0.0e+00| 0.0e+00| 0.29| 0.29| -4.7e-01| 4.7e-01| 1| 437| 738|
|L_u[4,1] | -1.3e-01| -1.4e-01| 0.28| 0.30| -5.8e-01| 3.3e-01| 1| 551| 923|
|L_u[5,1] | -9.0e-02| -1.0e-01| 0.27| 0.28| -5.3e-01| 3.7e-01| 1| 537| 750|
|L_u[6,1] | 1.8e-01| 2.0e-01| 0.28| 0.28| -3.1e-01| 6.1e-01| 1| 561| 727|
|L_u[7,1] | 4.0e-02| 5.0e-02| 0.29| 0.32| -4.5e-01| 5.2e-01| 1| 704| 1195|
|L_u[8,1] | -5.0e-02| -5.0e-02| 0.30| 0.31| -5.3e-01| 4.7e-01| 1| 893| 1385|
|L_u[1,2] | 0.0e+00| 0.0e+00| 0.00| 0.00| 0.0e+00| 0.0e+00| NA| NA| NA|
|L_u[2,2] | 9.5e-01| 9.7e-01| 0.07| 0.04| 8.0e-01| 1.0e+00| 1| 790| NA|

If I check the test set results, I find that the accuracy is high and the model is doing a good job at predicting behaviour.

My r-hat statistics are generally good. 99% of them are < 1.01 , and the largest value is around 1.02.

My model is a little complex but the main code is below.

As always, I really appreciate this community.

parameters {

  // these are the parameters we wish to fit to the data

  ////////////////////////////////////
  // fixed effects
  ////////////////////////////////////
  array[K] real b_a; // weights for class A compared to B  
  array[K] real b_stick; // stick-switch rates 
  array[K] real<lower = 0> rho_delta; // distance tuning
  array[K] real rho_psi; // direction tuning

  ///////////////////////////////
  // random effects
  ///////////////////////////////
  // random effect variances: 
  // 4*K as we have four fixed effect parameters x K conditions
  vector<lower=0>[4*K] sigma_u;
  // declare L_u to be the Choleski factor of a correlation matrix
  cholesky_factor_corr[4*K] L_u;
  // random effect matrix
  matrix[4*K,L] z_u; 
  
}

transformed parameters {

  /* 
  combine fixed and random effects
  we do this here so that the code in the model{} block is easier to read
  */

  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[4*K, L] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;

  // create empty arrays for everything
  array[K] vector[L] u_a, u_stick, u_delta, u_psi;
  // calculate
  for (kk in 1:K) {
    u_a[kk]     = to_vector(b_a[kk]       + u[4*(kk-1)+1]);
    u_stick[kk] = to_vector(b_stick[kk]   + u[4*(kk-1)+2]);
    u_delta[kk] = to_vector(rho_delta[kk] + u[4*(kk-1)+3]);
    u_psi[kk]   = to_vector(rho_psi[kk]   + u[4*(kk-1)+4]);
  }
}

model {

  /////////////////////////////////////////////////////
  // Define Priors
  ////////////////////////////////////////////////////

  // priors for random effects
  sigma_u ~ exponential(1);
  L_u ~ lkj_corr_cholesky(1.5); // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0, 1); // centred prior for random effects, so this should always be N(0,1)

  // priors for fixed effects
  for (kk in 1:K) {
    target += normal_lpdf(b_a[kk]       | prior_mu_b_a, prior_sd_b_a);
    target += normal_lpdf(b_stick[kk]   | prior_mu_b_stick, prior_sd_b_stick);
    target += normal_lpdf(rho_delta[kk] | prior_mu_rho_delta, prior_sd_rho_delta);
    target += normal_lpdf(rho_psi[kk]   | prior_mu_rho_psi, prior_sd_rho_psi);
  }

  // create some variables
  vector[n_targets] weights;
  int t, z, x; // trial, person, and condition

  //////////////////////////////////////////////////
  // // step through data row by row and define LLH
  //////////////////////////////////////////////////  
  for (ii in 1:N) {

    t = trial[ii];
    z = Z[t];
    x = X[t];
 
    weights = compute_weights(
      u_a[x, z], u_stick[x, z], u_delta[x, z], u_psi[x, z],
      to_vector(item_class[t]), S[ii], delta[ii], psi[ii],
      found_order[ii], n_targets, remaining_items[ii],
      d0); 

    // get likelihood of item selection
    target += log(weights[Y[ii]]);
   
  }
}

The E-BFMI is calculated as:

    efbmi_per_chain <- apply(energy, 2, function(x) {
      (sum(diff(x)^2) / length(x)) / stats::var(x)
    })

where x is an iterations x chains matrix of energy__ values. That is, it divides the mean squared difference of the energy (from consecutive iterations) by the variance of the energy. To get NaN I think you’d have to have both the numerator and the denominator be 0 (if just the denominator is 0 then I think R would call it Inf but if both are 0 it’s NaN).

What do you get if you directly compute the variance of the energy__ diagnostic values?

energy <- posterior::extract_variable_matrix(fit$sampler_diagnostics(), "energy__")
apply(energy, 2, var) # each column is a chain

Also, it’s suspicious that lp__ has an ess_tail of NA. My guess is that this is related to the E-BFMI issue. If the ESS for the tail of the log probability distribution can’t be computed reliably, that seems consistent with a situation where the energy values show little to no variation.

I think this probably means that the sampler is having trouble exploring the tails. Maybe the log‐density in the tails is unusually flat or the chains are somehow getting stuck in an area of almost no curvature? I guess it’s possible that the step size is actually too small so that each trajectory results in very small changes (you could potentially verify that by checking if accept_stat__ is quite close to 1, which can indicate the step size is too small). Are you using the default adapt_delta value or did you increase it to avoid divergences? If adapt_delta is high to avoid divergences in high curvature areas that might produce a step size that is too small for the tail geometry.

Anyway, I’m just speculating here and I could be wrong about what’s going on (I haven’t run into the NaN E-BFMI situation, or at least not recently enough to remember it). Hopefully other people have some ideas too.

1 Like

Also just speculating, but R will also return NaN for Inf/Inf, which would happen if the energy ever overflows at any iteration. But I don’t think I’ve ever seen this come close to happening, nor is it clear to me whether/how it would be possible for the sampler to experience this without complaining, so perhaps other explanations are more likely.

1 Like

This is typical if you have a correlation matrix where values are pinned to 1. For Cholesky factors, I’m not sure why L_u[1, 1] is pinned to 1 for all draws, hence zero standard deviation. That’s where the problem’s coming from.

I don’t now how compute_weights works, but seeing that you’re doing a log of the whole thing at the end, it brings up the question of whether the function could just implement the log weights—it’s usually more stable to always work on the log scale.

P.S. I would suggest not prefixing parameter names with prior_ because it just makes the code harder to read. I would also get rid of the ///// comments—they just obscure the actual code. One generally doesn’t need code like the one for L_u—the declaration says it’s a Cholesky factor of a correlation matrix. You can declare and define in the same line, so, for example, the definition of u can be a one-linear.

Hello,

thank you both very much for your responses. It sounds like I am going to have to further upskill and learning more about the technical side of things.

> energy <- posterior::extract_variable_matrix(m$sampler_diagnostics(), "energy__")
> apply(energy, 2, var) # each column is a chain
       1        2        3        4 
20.00000 39.91984  0.00000  0.00000 

you could potentially verify that by checking if accept_stat__ is quite close to 1, which can indicate the step size is too small

> m$sampler_diagnostics(format = "df")
# A draws_df: 500 iterations, 4 chains, and 6 variables
   treedepth__ divergent__ energy__ accept_stat__ stepsize__ n_leapfrog__
1            7           0    14600          0.95      0.045          127
2            7           0    14600          1.00      0.045          127
3            7           0    14600          0.99      0.045          127
4            7           0    14600          0.82      0.045          127
5            7           0    14600          0.93      0.045          127
6            7           0    14600          0.98      0.045          127
7            8           0    14600          0.97      0.045          383
8            7           0    14600          0.96      0.045          127
9            7           0    14600          0.86      0.045          127
10           7           0    14600          0.56      0.045          127
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
treedepth__ 6.969500e+00 7.000e+00 0.1886466 0.0000000 7.00e+00 7.0000e+00 1.004962 1808.681292 2022.441
divergent__ 0.000000e+00 0.000e+00 0.0000000 0.0000000 0.00e+00 0.0000e+00 NA NA NA
energy__ 1.460005e+04 1.460e+04 3.8736291 0.0000000 1.46e+04 1.4600e+04 1.000203 2014.542814 2018.530
accept_stat__ 9.155375e-01 9.470e-01 0.0930245 0.0607866 7.33e-01 9.9705e-01 1.002808 1975.383394 2023.318
stepsize__ 4.385000e-02 4.455e-02 0.0016923 0.0008154 4.10e-02 4.5300e-02 Inf 4.065041 NA
n_leapfrog__ 1.334960e+02 1.270e+02 31.5045751 0.0000000 1.27e+02 2.5500e+02 1.003856 2061.588146 2018.530

Are you using the default adapt_delta value or did you increase it to avoid divergences? If adapt_delta is high to avoid divergences in high curvature areas that might produce a step size that is too small for the tail geometry.

We haven’t changed adapt_delta. Are you suggesting that I try a lower value? I’ll give it a shot.

I have managed to recreate the NA lp__ issue using some simulated data. (Although not the EBFMI warnings). I will have a go varying adapt_detla and see if that improves it.

This is typical if you have a correlation matrix where values are pinned to 1. For Cholesky factors, I’m not sure why L_u[1, 1] is pinned to 1 for all draws, hence zero standard deviation. That’s where the problem’s coming from.

Are you suggesting that there is a better way to specify random effects here? I’ve mainly been following some example tutorial code here [Bayesian multilevel models using R and Stan (part 1) | Matteo Lisi] on how to set this up.

I don’t now how compute_weights works, but seeing that you’re doing a log of the whole thing at the end, it brings up the question of whether the function could just implement the log weights—it’s usually more stable to always work on the log scale.

Most of the weight computations are done on the log scale. (although it is always possible I have made a dumb mistake). The issue is that I haven’t gotten around to working out a solution for the last stage:

vector compute_weights(
    real u_a, real u_s, real u_delta, real u_psi,
    vector item_class, vector match_prev_item, vector delta, vector psi,
    int n, int n_targets, vector remaining_items,
    real d0) {

    vector[n_targets] weights;
    
    // set the weight of each target to be its class weight
    weights = log_inv_logit(u_a * to_vector(item_class));
    
    // multiply weights by stick/switch preference
    weights += log_inv_logit(u_s * match_prev_item); 

    // calculate by spatial weights
    weights += compute_spatial_weights(
      n, n_targets, 
      u_delta, u_psi, 
      delta, psi,
      d0);
        
    // remove already-selected items, and standarise to sum = 1 
    weights = standarise_weights(exp(weights), n_targets, remaining_items); 

    return(weights);

  }

vector standarise_weights(vector w, int n_targets, vector remaining_items) {

    /* set weights of found items to 0 and divide by the sum of 
    remaining weights so that they sum to 1 */
    vector[n_targets] w_s = w .* remaining_items;  
    w_s = w_s / sum(w_s);
    return(w_s);
  }

In short, I am modelling a cognitive psychology foraging task in which people select a number of items one at a time (sampling without replacement). I am keeping track of which items have been selected so far with remaining_items. Items that have been selected already cannot be re-selected so have 0 probability weight. Which doesn’t play nicely with taking the log.

I suppose one way around this would be to simply remove each item as it is selected. So as we step through the data, the weights vector decreases in length? I would need an additional index variable to keep track of which item is which in my weights array, and that sounds a little fiddly. But maybe the benefit of avoid the 0 is worthwhile?

Not necessarily. I just don’t understand why you would get an unmoving 1 value if it’s not a correlation matrix diagonal.

It’s unstable to use exp() like this here. A first step would be to subtract the maximum weight.

Also, this is called “normalization” when enforcing a simplex structure. It’s called “standardization” when you shift and scale a distribution so it has mean zero and variance one.

normalized_weights = exp(log_weights) / sum(exp(log_weights));

But this is unstable because the exp() operations can easily overflow or underflow and you lose precision round tripping through exp/log.

The preferable alternative is to just stay on the log scale as follows, which works out to the log of what you did and won’t require an additional log on the outside.

log_normalized_weights = log_weights - log_sum_exp(weights);

That operation should be stable and it keeps you on the log scale. the equivalent operation to setting to zero is setting to negative_infinity(), but you really don’t want to do that explicitly as it will upset autodiff if you try to use the values. So just normalize everything else and set the remaining values to negative_infinity() and it should be good to go. You are already taking logs of zeros in what you’e doing with the log on the outside—that should be fine as long as it doesn’t infect autodiff.

Cool. The only other person I’ve seen talk about this task is @richard_mcelreath, who was talking about career trajectories for hunters vs. gatherers at a StanCon ages ago.

lp__ has unusually skewed distribution as mean is far away from 90% central interval and sd is much higher than mad.

Combining this information with information from lp__ rhat, ess_bulk, and ess_tail, indicates that the chains are otherwise mixing, but two chains have additionally some very extremely high lp__ values which shows also in the variance of energy. Maybe use bayesplot::mcmc_trace() to look at the trace of lp__?

How many iterations of warmup and post-warmup you were running?

Super helpful.

I’m glad there’s a way to improve that function.

One question i have is how to best to deal with the 0s. Let me explain:

At present, the model computes weights w for every item in the display. (these are computed based on four parameters that are estimated from the data). If there were five items, I’d could have something like w = [2, 3, 1, 5, 2]. However, as a trial unfolds, these items get selected by the human participant and so cannot get selected again. I’m tracking this in a binary vector named remaining_items.. So to continue with our toy example, this might be r = [0, 1, 1, 0, 1].

At present, the model takes w .* r to get [0, 3, 1, 0, 2], and then normalises this vector to get the weights. When I do the log transform as the final step, the log(0) aren’t a problem, as they are never accessed.

I understand the reasoning for doing all of this on the log scale. However, I’m unsure of how log_sum_exp(w) is going to work when I have a a load of 0 values in w.

One obvious way to avoid these issues is to only ever compute w for the items that are eligible to be selected. In initially avoid this as I thought having w a constant size would be faster than shrinking the length of w by one for each time we step through a trial. (We’d also have to track the item IDs for which item is which). Perhaps this is the only way to make everything work smoothly though?

1 Like

log_sum_exp(w) shouldn’t be bothered by zeros in w since log_sum_exp(w) would be computed as

max <- max(w)
sum <- sum(exp(w - max))
max + log(sum)

so nowhere is the log of the elements of w computed.

Sorry,
I didn’t explain things very well…

I thought the aim on log_sum_exp() was to allow everything to be done on the log scale. I have 0s in my w vector, which then give me -Inf in log(w).

I should have asked how log_sum_exp handles the -Inf?

It should “ignore” it since exp(-Inf) will be 0. So if x1 is a vector of finite numbers and x2 is the same as x1 but also with some -Inf terms then log_sum_exp(x1) is equal to log_sum_exp(x2).

In the definition of log_sum_exp(w)

max <- max(w)
sum <- sum(exp(w - max))
max + log(sum)

the -Inf wouldn’t contribute anything to the sum term because the exp of it is 0.

Edit: I haven’t tested this with Stan’s implementation (I assume it’s similar), but if you define an R function using the definition above you can play around with it and see how it handles various kinds of inputs.

Also I should have said that this is used for numerical stability. You can think of it conceptually as just doing log(sum(exp(w)).

Huh,

that is all - if it works - surprisingly easy. I clearly still don’t have a good feeling for what is easy and what is hard in Stan :)

I’ll do some tests and report back.

Hello, a brief update:

I have made a whole load of changes based on the advice above. We no longer get the NAN E-BFMI warning, but the lp__ ess_tail is still NA.

The traceplot for lp__ seems very odd:

I can contrast this with the traceplots for the four (x 2) main parameters in my model:

At present, I am using 500 iterations for both warmup and sampling (and default adapt_delta etc)

A few more details:

we are curerntly testing this model over three datasets (we also have a data simulation pipeline for testing). The lp__ NA ess_tail appears when we fit to two of our datasets, but not the third.

For the dataset that doesn’t show any problems, the lp__ traceplot looks like this:

I’ll do my best to find a set of simulation parameters that lead to the issue.

Both lp__ traceplots are strange as it seems there are two modes, and each mode is very narrow. I had a quick look of the code, but it’s complex enough that couldn’t spot the part which would explain the strong bi-modality, but I suspect some error in there

Thanks.

I’ll have a hunt. I’ve tested the “simple” verison of my model (i.e, without the random effect structure running on a small simulated dataset) and obtain:

This appears to be a lot better, but I’m not all that familiar with how to judge these plots. I can see that it is a bit more skewed than the textbook examples of “good traceplots”.

This looks fine. The posterior of lp__ is almost always skewed

1 Like

ok good to know. I will see if I can find a min working example that generates the odd bimodal behaviour.

It doesn’t seem to be my multi-level model, as that seems to work ok on simulated data.

I suppose there might be some ceiling effects in parameter estimates.