Exception: poisson_log_lpmf: Log rate parameter is -nan, but must be not nan!

With the code below, I get a large number of divergent transitions and the errors listed below. The exceptions with cmdstanr are different from the warnings with rstan. A model with complete pooling ran without divergent transitions. However, modelling the effects of different grouping factors is important to understanding the data. btw, the data is simulated in python.
Other formula versions, such as “count ~ 0 + Intercept + (species||sday) + (species|g1|site), zi ~ species|g1|site” give the same exceptions.
Do the exceptions with cmdstanr point to a problem with priors? If so, which prior?
I have done extensive diagnostics. for example priorsense points to prior-likelihood conflict with b_zi_Intercept, sd_site_Intercept, cor_site__Intercept__species, r_site[2,Intercept] (all sites) amongst others. I’ve tried to weaken the priors, but the exceptions remain.
The divergences in the pairs plot are mainly in the centre, not concentrated in any far corner. On the other hand, rhats, ESS’s are ok, and all pareto k<0.7.
a plot comparing data to posterior draws looks like the model predicts quite well.
b09-trial

A simpler model with complete pooling did not give the same exceptions and only 4 divergent transitions after warmup. The formula was: count ~ 0 + Intercept + sday + site + species, zi ~ species

with cmdstanr (sometimes poisson_log_lpmf and other times bernoulli_logit_lpmf):
Run 1
Exception: poisson_log_lpmf: Log rate parameter is -nan, but must be not nan! (in ‘/tmp/RtmpBfWwL1/model-75d152821d17.stan’, line 78, column 6 to line 80, column 54) (in ‘/tmp/RtmpBfWwL1/model-75d152821d17.stan’, line 173, column 6 to column 74)
Run 2
Exception: bernoulli_logit_lpmf: Logit transformed probability parameter is -nan, but must be not nan! (in ‘/tmp/RtmpoL6rYU/model-25537f7a97fb.stan’, line 78, column 6 to line 80, column 54) (in ‘/tmp/RtmpoL6rYU/model-25537f7a97fb.stan’, line 173, column 6 to column 74)

with rstan:
Warning: There were 830 divergent transitions after warmup. See
Runtime warnings and convergence problems
to find out why this is a problem and how to eliminate them.Warning: Examine the pairs() plot to diagnose sampling problems

the brms code:

zifamily = zero_inflated_poisson(link = "log", link_zi = "logit")
sp_formula = bf(count ~ 0 + Intercept + sday + (species|g1|site), zi ~ species|g1|site)
zip_prior <- c(set_prior("normal(0,10)", class = "b"),
			   set_prior("lkj(2)", class = "cor"),
			   set_prior("student_t(3, 0, 5)", class = "sd"),
			   set_prior("student_t(3, 0, 5)", class = "Intercept", dpar = "zi"),
			   set_prior("student_t(3, 0, 5)", class = "sd", dpar = "zi"))

fit <-
  brm(data = my_data,
      family = zifamily,
      formula = sp_formula,
      prior = zip_prior,
      iter = 2000, warmup = 1000, thin = 1, chains = 4, cores = 4,
      # seed = 9,
  	  refresh = 0,
  	backend = "cmdstanr",
  	sample_prior = FALSE
  	)

if it helps, this is the generated Stan code:

// generated with brms 2.20.4
functions {
  /* compute correlated group-level effects
   * Args:
   *   z: matrix of unscaled group-level effects
   *   SD: vector of standard deviation parameters
   *   L: cholesky factor correlation matrix
   * Returns:
   *   matrix of scaled group-level effects
   */
  matrix scale_r_cor(matrix z, vector SD, matrix L) {
    // r is stored in another dimension order than z
    return transpose(diag_pre_multiply(SD, L) * z);
  }
  /* zero-inflated poisson log-PDF of a single response
   * Args:
   *   y: the response value
   *   lambda: mean parameter of the poisson distribution
   *   zi: zero-inflation probability
   * Returns:
   *   a scalar to be added to the log posterior
   */
  real zero_inflated_poisson_lpmf(int y, real lambda, real zi) {
    if (y == 0) {
      return log_sum_exp(bernoulli_lpmf(1 | zi),
                         bernoulli_lpmf(0 | zi) + poisson_lpmf(0 | lambda));
    } else {
      return bernoulli_lpmf(0 | zi) + poisson_lpmf(y | lambda);
    }
  }
  /* zero-inflated poisson log-PDF of a single response
   * logit parameterization of the zero-inflation part
   * Args:
   *   y: the response value
   *   lambda: mean parameter of the poisson distribution
   *   zi: linear predictor for zero-inflation part
   * Returns:
   *   a scalar to be added to the log posterior
   */
  real zero_inflated_poisson_logit_lpmf(int y, real lambda, real zi) {
    if (y == 0) {
      return log_sum_exp(bernoulli_logit_lpmf(1 | zi),
                         bernoulli_logit_lpmf(0 | zi)
                         + poisson_lpmf(0 | lambda));
    } else {
      return bernoulli_logit_lpmf(0 | zi) + poisson_lpmf(y | lambda);
    }
  }
  /* zero-inflated poisson log-PDF of a single response
   * log parameterization for the poisson part
   * Args:
   *   y: the response value
   *   eta: linear predictor for poisson distribution
   *   zi: zero-inflation probability
   * Returns:
   *   a scalar to be added to the log posterior
   */
  real zero_inflated_poisson_log_lpmf(int y, real eta, real zi) {
    if (y == 0) {
      return log_sum_exp(bernoulli_lpmf(1 | zi),
                         bernoulli_lpmf(0 | zi) + poisson_log_lpmf(0 | eta));
    } else {
      return bernoulli_lpmf(0 | zi) + poisson_log_lpmf(y | eta);
    }
  }
  /* zero-inflated poisson log-PDF of a single response
   * log parameterization for the poisson part
   * logit parameterization of the zero-inflation part
   * Args:
   *   y: the response value
   *   eta: linear predictor for poisson distribution
   *   zi: linear predictor for zero-inflation part
   * Returns:
   *   a scalar to be added to the log posterior
   */
  real zero_inflated_poisson_log_logit_lpmf(int y, real eta, real zi) {
    if (y == 0) {
      return log_sum_exp(bernoulli_logit_lpmf(1 | zi),
                         bernoulli_logit_lpmf(0 | zi)
                         + poisson_log_lpmf(0 | eta));
    } else {
      return bernoulli_logit_lpmf(0 | zi) + poisson_log_lpmf(y | eta);
    }
  }
  // zero-inflated poisson log-CCDF and log-CDF functions
  real zero_inflated_poisson_lccdf(int y, real lambda, real zi) {
    return bernoulli_lpmf(0 | zi) + poisson_lccdf(y | lambda);
  }
  real zero_inflated_poisson_lcdf(int y, real lambda, real zi) {
    return log1m_exp(zero_inflated_poisson_lccdf(y | lambda, zi));
  }
}
data {
  int<lower=1> N; // total number of observations
  array[N] int Y; // response variable
  int<lower=1> K; // number of population-level effects
  matrix[N, K] X; // population-level design matrix
  // data for group-level effects of ID 1
  int<lower=1> N_1; // number of grouping levels
  int<lower=1> M_1; // number of coefficients per level
  array[N] int<lower=1> J_1; // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_1;
  vector[N] Z_1_2;
  int<lower=1> NC_1; // number of group-level correlations
  // data for group-level effects of ID 2
  int<lower=1> N_2; // number of grouping levels
  int<lower=1> M_2; // number of coefficients per level
  array[N] int<lower=1> J_2; // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_zi_1;
  vector[N] Z_2_zi_2;
  int<lower=1> NC_2; // number of group-level correlations
  int prior_only; // should the likelihood be ignored?
}
transformed data {
  
}
parameters {
  vector[K] b; // regression coefficients
  real Intercept_zi; // temporary intercept for centered predictors
  vector<lower=0>[M_1] sd_1; // group-level standard deviations
  matrix[M_1, N_1] z_1; // standardized group-level effects
  cholesky_factor_corr[M_1] L_1; // cholesky factor of correlation matrix
  vector<lower=0>[M_2] sd_2; // group-level standard deviations
  matrix[M_2, N_2] z_2; // standardized group-level effects
  cholesky_factor_corr[M_2] L_2; // cholesky factor of correlation matrix
}
transformed parameters {
  matrix[N_1, M_1] r_1; // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_1] r_1_1;
  vector[N_1] r_1_2;
  matrix[N_2, M_2] r_2; // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_2] r_2_zi_1;
  vector[N_2] r_2_zi_2;
  real lprior = 0; // prior contributions to the log posterior
  // compute actual group-level effects
  r_1 = scale_r_cor(z_1, sd_1, L_1);
  r_1_1 = r_1[ : , 1];
  r_1_2 = r_1[ : , 2];
  // compute actual group-level effects
  r_2 = scale_r_cor(z_2, sd_2, L_2);
  r_2_zi_1 = r_2[ : , 1];
  r_2_zi_2 = r_2[ : , 2];
  lprior += normal_lpdf(b | 0, 5);
  lprior += logistic_lpdf(Intercept_zi | 0, 1);
  lprior += normal_lpdf(sd_1 | 0, 5) - 2 * normal_lccdf(0 | 0, 5);
  lprior += lkj_corr_cholesky_lpdf(L_1 | 1);
  lprior += normal_lpdf(sd_2 | 0, 5) - 2 * normal_lccdf(0 | 0, 5);
  lprior += lkj_corr_cholesky_lpdf(L_2 | 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] zi = rep_vector(0.0, N);
    mu += X * b;
    zi += Intercept_zi;
    for (n in 1 : N) {
      // add more terms to the linear predictor
      mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_1_2[J_1[n]] * Z_1_2[n];
    }
    for (n in 1 : N) {
      // add more terms to the linear predictor
      zi[n] += r_2_zi_1[J_2[n]] * Z_2_zi_1[n]
               + r_2_zi_2[J_2[n]] * Z_2_zi_2[n];
    }
    for (n in 1 : N) {
      target += zero_inflated_poisson_log_logit_lpmf(Y[n] | mu[n], zi[n]);
    }
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(to_vector(z_1));
  target += std_normal_lpdf(to_vector(z_2));
}
generated quantities {
  // actual population-level intercept
  real b_zi_Intercept = Intercept_zi;
  // compute group-level correlations
  corr_matrix[M_1] Cor_1 = multiply_lower_tri_self_transpose(L_1);
  vector<lower=-1, upper=1>[NC_1] cor_1;
  // compute group-level correlations
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1, upper=1>[NC_2] cor_2;
  // extract upper diagonal of correlation matrix
  for (k in 1 : M_1) {
    for (j in 1 : (k - 1)) {
      cor_1[choose(k - 1, 2) + j] = Cor_1[j, k];
    }
  }
  // extract upper diagonal of correlation matrix
  for (k in 1 : M_2) {
    for (j in 1 : (k - 1)) {
      cor_2[choose(k - 1, 2) + j] = Cor_2[j, k];
    }
  }
}

This is just because they wrap the underlying exceptions from the Stan C++ layer slightly differently. The root causes will be the same because they run the same sampling code.

Divergences in general mean that the first-order approximation to the Hamiltonian we compute with discrete time steps following gradients can’t keep up with regions of high curvature. So we take a gradient step and it’s way outside where we should be and things spiral out of control. One way to smooth out these regions of high curvature is through a stronger prior, not a weaker prior.

For example, where you have poisson_log_lpmf: Log rate parameter is -nan, but must be not nan! this can come from a divide by zero, which can arise through underflow. I don’t know brms, but what you can do is impose a prior on the log rate that keeps it away from zero. You can figure out where that is by finding the lines referenced in the manual. It looks like eta.

The problem with our divergence plots is that they plot where a divergent trajectory started, not where it crossed into trouble.

Stan can run into divergences because it runs values out too large then maps them back and underflows. You can prevent this by reducing from Cauchy or Student-t to normal priors.

It looks like brms is already applying a non-centered parameterization, which is where a lot of these problems arise in hand-written Stan programs.

Thanks, Bob.

This is useful advice. I had a case where there was a concentration of divergent transitions in a blob outside the main cloud of samples. I applied a stronger prior and the blob disappeared! I was worried that I was biasing the sampling, but I understand that it was OK.

I will explore the way to impose a prior on the log rate that keeps it away from zero.
I increased the adapt delta to 0.99 in an another run and there was only 1 divergence, although the exception occurred again. Another point, I applied a thinning of 4. could that have deleted divergences?