How to address warnings about low ESS and low BFMI?

We can ignore the step_size parameter, it’s a tuning start parameter.

It implies that the maximum tree depth should increased, as NUTS demands a more extensive tree exploration to capture the curvature effectively.

rstan was started with:

zrn_fit <- sampling(  
    smod
  , data    = zrn_dat
  , iter    = 1000
  , chains  = 2
  , seed    = 12
  , control = list(
      adapt_delta   = 0.99
    , max_treedepth = 16
  )
)

Although the following messages were shown

Warning messages:
1: The largest R-hat is NA, indicating chains have not mixed.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#r-hat 
2: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#bulk-ess 
3: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
https://mc-stan.org/misc/warnings.html#tail-ess

We should increase the number of iterations.
The checks indicating the run looks good. I don’t know, why rstan shows
a different message than cmdstanr.

 get_bfmi(zrn_fit)
[1] 0.6352896 0.6315200
> check_hmc_diagnostics(zrn_fit)

Divergences:
0 of 1000 iterations ended with a divergence.

Tree depth:
0 of 1000 iterations saturated the maximum tree depth of 16.

Energy:
E-BFMI indicated no pathological behavior.

GET IT!! I am using 10,000 iterations and 4 chains to verify the stability of the results.If the results continue to be good, I will attempt to add random slopes to this model.

Here is my updates.
1.When I use 4 chains and run 20,000 iterations, the only warning I receive is ‘There were 20 divergent transitions after warmup,’ while other metrics like Rhat, BFMI, and ESS seem fine. Considering the large number of iterations in my sample, can I ignore this issue?
2.I tried to add random slopes for two groups, and it seems like there were no errors. I am currently waiting for the final results from a larger number of iterations, which will take some time.

parameters{
real alphaaq;
real beta1aq;
real beta2aq;
real<lower=0> sigma_alphaaq;
real<lower=0> sigma_beta1aq;
real<lower=0> sigma_beta2aq;
real<lower=0> sigmaaq;
real alphacr;
real beta1cr;
real beta2cr;
real<lower=0> sigma_alphacr;
real<lower=0> sigma_beta1cr;
real<lower=0> sigma_beta2cr;
real<lower=0> sigmacr;
real alphaadt;
real beta1adt;
real beta2adt;
real beta3adt;
real beta4adt;
real<lower=0> sigma_alphaadt;
real<lower=0> sigma_beta1adt;
real<lower=0> sigma_beta2adt;
real<lower=0> sigma_beta3adt;
real<lower=0> sigma_beta4adt;
real<lower=0> sigma_adt;
cholesky_factor_corr[11] Omega_chol1;
cholesky_factor_corr[11] Omega_chol2;
vector<lower=0>[11] sigma_ranef1;
vector<lower=0>[11] sigma_ranef2;
matrix[J,11] gamma1_raw;
matrix[G,11] gamma2_raw;
vector[11] mu1;
vector[11] mu2;
}
transformed parameters{
vector[J] gamma1alphaaq;
vector[J] gamma1alphacr;
vector[J] gamma1alpha;
vector[J] gamma1beta1aq;
vector[J] gamma1beta2aq;
vector[J] gamma1beta1cr;
vector[J] gamma1beta2cr;
vector[J] gamma1beta1adt;
vector[J] gamma1beta2adt;
vector[J] gamma1beta3adt;
vector[J] gamma1beta4adt;
vector[G] gamma2alphaaq;
vector[G] gamma2alphacr;
vector[G] gamma2alpha;
vector[G] gamma2beta1aq;
vector[G] gamma2beta2aq;
vector[G] gamma2beta1cr;
vector[G] gamma2beta2cr;
vector[G] gamma2beta1adt;
vector[G] gamma2beta2adt;
vector[G] gamma2beta3adt;
vector[G] gamma2beta4adt;
vector[N] mu_adt;
vector[N] mu_aq;
vector[N] mu_cr;
matrix[J,11] gamma1 = rep_matrix(mu1’, J) + gamma1_raw * diag_pre_multiply(sigma_ranef1, Omega_chol1)‘; // random effects for id
matrix[G,11] gamma2 = rep_matrix(mu2’, G) + gamma2_raw * diag_pre_multiply(sigma_ranef2, Omega_chol2)';
matrix[11,11] DC1 = diag_pre_multiply(sigma_ranef1, Omega_chol1);
matrix[11,11] DC2 = diag_pre_multiply(sigma_ranef2, Omega_chol2);
for (j in 1:J){
gamma1alphaaq[j] = gamma1[j,1];
gamma1alphacr[j] = gamma1[j,2];
gamma1alpha[j]= gamma1[j,3];
gamma1beta1aq[j]= gamma1[j,4];
gamma1beta1cr[j]= gamma1[j,5];
gamma1beta1adt[j]= gamma1[j,6];
gamma1beta2aq[j]= gamma1[j,7];
gamma1beta2cr[j]= gamma1[j,8];
gamma1beta2adt[j]= gamma1[j,9];
gamma1beta3adt[j]= gamma1[j,10];
gamma1beta4adt[j]= gamma1[j,11];
}
for (g in 1:G) {
gamma2alphaaq[g] = gamma2[g,1];
gamma2alphacr[g] = gamma2[g,2];
gamma2alpha[g]= gamma2[g,3];
gamma2beta1aq[g]= gamma2[g,4];
gamma2beta1cr[g]= gamma2[g,5];
gamma2beta1adt[g]= gamma2[g,6];
gamma2beta2aq[g]= gamma2[g,7];
gamma2beta2cr[g]= gamma2[g,8];
gamma2beta2adt[g]= gamma2[g,9];
gamma2beta3adt[g]= gamma2[g,10];
gamma2beta4adt[g]= gamma2[g,11];
}
for (n in 1:N) {
mu_aq[n] = alphaaq + gamma1alphaaq[Group[n]] + gamma2alphaaq[Group2[n]]+
(beta1aq+gamma1beta1aq[Group[n]]+gamma2beta1aq[Group2[n]]) * c1[n] +
(beta2aq+gamma1beta2aq[Group[n]]+gamma2beta2aq[Group2[n]]) * c2[n];
mu_cr[n] = alphacr + gamma1alphacr[Group[n]] + gamma2alphacr[Group2[n]]+
(beta1cr+gamma1beta1cr[Group[n]]+gamma2beta1cr[Group2[n]]) * c1[n] +
(beta2cr+gamma1beta2cr[Group[n]]+gamma2beta2cr[Group2[n]]) * c2[n];
mu_adt[n] = alphaadt + gamma1alpha[Group[n]] + gamma2alpha[Group2[n]]+
(beta1adt+gamma1beta1adt[Group[n]]+gamma2beta1adt[Group2[n]]) * c1[n] +
(beta2adt+gamma1beta2adt[Group[n]]+gamma2beta2adt[Group2[n]]) * c2[n] +
(beta3adt+gamma1beta3adt[Group[n]]+gamma2beta4adt[Group2[n]]) * aq[n] +
(beta4adt+gamma1beta4adt[Group[n]]+gamma2beta4adt[Group2[n]]) * cr[n];
}}

3.There is an additional question. Since my experiment involves testing both internal and external validity, there may be two different sets of data, but they follow the same model. If the data is different, should I adjust the degrees of freedom or other parameters for the prior distribution?

First 20000 iterations quite a lot. Did you try to run only 2000 iterations? Stan requires much less iterations than Gibbs samplers.
You may tighten the priors or switch to a normal distribution, eg.

 sigma_alphaadt ~ student_t(5, 0, 2);

change to

 sigma_alphaadt ~ normal(0, 1.5);

You will find tons of posts in the forum about divergent transitions.

Random slopes and intercept can be correlated, thus a correlation matrix in between
makes sense. See the classic book Bayesian Data Analysis by Andrew Gelman.

The priors are the beliefs about your data a-priori. In cases where lots of data is available less tight
priors are needed. Imposing overly restrictive priors can lead to bias. Non-restrictive priors increases the difficulty for the sampler. From my experience, I’d go for non-restrictive priors first and have a look at diagnostic plots.

I have another question! If I want to incorporate the covariance between random slopes in the mediation effect, can I use these codes?

matrix[J,11] gamma1 = rep_matrix(mu1’, J) + gamma1_raw * diag_pre_multiply(sigma_ranef1, Omega_chol1)‘; // random effects for id
matrix[G,11] gamma2 = rep_matrix(mu2’, G) + gamma2_raw * diag_pre_multiply(sigma_ranef2, Omega_chol2)';

matrix[11,11] DC1 = diag_pre_multiply(sigma_ranef1, Omega_chol1);
matrix[11,11] DC2 = diag_pre_multiply(sigma_ranef2, Omega_chol2);

generated quantities{
real avgIndEffect1_1;
real avgIndEffect1_2;
real avgIndEffect2_1;
real avgIndEffect2_2;
real avgIndEffect3_1;
real avgIndEffect3_2;
real avgIndEffect4_1;
real avgIndEffect4_2;
avgIndEffect1_1=beta1aq * beta3adt+DC1[4,10];//for id
avgIndEffect1_2=beta1aq * beta3adt+DC2[4,10];//for topic
avgIndEffect2_1=beta1cr * beta4adt+DC1[5,11];
avgIndEffect2_2=beta1cr * beta4adt+DC2[5,11];
avgIndEffect3_1=beta2aq * beta3adt+DC1[7,10];
avgIndEffect3_2=beta2aq * beta3adt+DC2[7,10];
avgIndEffect4_1=beta2cr * beta4adt+DC1[8,11];
avgIndEffect4_2=beta2cr * beta4adt+DC2[8,11];

I have some confusion about random slopes~

Hello Professor! When I introduced random slopes, all effects became non-significant (confidence intervals cross 0). I have posted the detailed code in a new thread and hope to receive a response.

Random slopes and intercept can be correlated, thus a correlation matrix in between
makes sense. See the classic book Bayesian Data Analysis by Andrew Gelman.

It seems that directly correlating random slopes with random intercepts is not effective in this version; none of the coefficients are significant. However, this issue does not arise in the initial centering parameterization version.

Have you checked the parameter mu? It’s part of the parameters.
I am not sure what the excel sheet can tell us?

Sorry,I made some mistakes in this excel sheet, I put both versions of the code in Why The effects unexpectedly became smaller after employing non-central parameterization?.
btw,In what way should I check parameter mu?I don’t know what the problem is.

In what way should I check parameter mu?

Because when gamma1 and gamma2 are non-significant than likely both they and mu
are non-significant. Also the web-page code doesn’t use mu. Thus you may omit mu and
and run the sampler again and check if some gamma parameters become relevant.

Also the web-page code doesn’t use mu .

Why The effects unexpectedly became smaller after employing non-central parameterization?
Means centered parameterization version or non-centered parameterization version? So I need to remove mu_aq, mu_cr, mu_adt from the transformed parameters section.
And i am so confused why the significance of the effect is so different in the two versions of the model.

I meant this ones.

So I removed

vector[11] mu1;
vector[11] mu2;
mu1 ~ normal(0, 2);
mu2 ~ normal(0, 2);

And changed

matrix[J,11] gamma1 = rep_matrix(mu1’, J) + gamma1_raw * diag_pre_multiply(sigma_ranef1, Omega_chol1) ; // random effects for id
matrix[G,11] gamma2 = rep_matrix(mu2’, G) + gamma2_raw * diag_pre_multiply(sigma_ranef2, Omega_chol2)';

to

matrix[J, 11] gamma1 = gamma1_raw * diag_pre_multiply(sigma_ranef1, Omega_chol1)‘;
matrix[G, 11] gamma2 = gamma2_raw * diag_pre_multiply(sigma_ranef2, Omega_chol2)’;

Is it right?

I tried removing mu, obtained mostly significant effects (similar to the case with only random intercept version). Does this imply that mu has an impact on gamma and I should delete it?
result with mu:


result without mu:

Now I see mu is covered by alphaaq, alphaaq, alphaadt, thus we remove it.
I think a more clear mathematical description and a better formatting would be beneficial
to avoid some of the confusions.

1 Like