Low ESS and High RHat for Random Intercept & Slope Simulation (rstan and rstanarm)

joeHoover · July 30, 2019, 11:50pm

I have been working on a fairly complicated hierarchical model in Stan and everything was going fine until I introduced a random slope. Upon the introduction of the random slope, I started observing R-Hat > 1.01 and low effective sample sizes.

In an attempt to figure out what was going on, I started reducing the complexity of my simulation and the corresponding model. However, now I’ve wound up all the way back at a basic random intercept and slope simulation, which I’m now trying to fit in rstanarm, and I’m still observing the same issues.

Below, I’ve included my simulation code, code for running it, code for fitting the corresponding lme4 and rstanarm models, and my results. And this file hlm_sim.R (2.3 KB) contains all of the code.

I am hoping someone might be able to provide some insight into why I’m observing less than optimal sampling and mixing.

For observation y_{tj}, where t \in 1, ..., T=50 drawn from region j \in 1, ..., J=50, the DGP is:

y_{tj} \sim N(\mu_{tj}, \sigma_e)
\mu_{tj} = \beta_{0j} + x_{tj}*\beta_{1j}
\beta_{0j} = \gamma_{00} + u_{0j}
\beta_{1j} = \gamma_{10} + u_{1j}

\left( \begin{array} 1 u_{0j} \\ u_{1j} \end{array}\right) \sim \left( \begin{array} 1 \left( \begin{array} 1 0 \\ 0 \end{array}\right), \left( \begin{array} 1 \sigma_{u0j}^2 \\ \rho\sigma_{u0j}\sigma_{u1j} \end{array} \begin{array} 1 \rho\sigma_{u0j}\sigma_{u1j} \\ \sigma_{u0j}^2 \end{array} \right) \end{array}\right)

where \gamma_{00} = .5, \gamma_{10} = 1 \sigma_{u0j} = .2, \sigma_{u1j} = .2, \rho = .3, \sigma_e = .2

Finally, x_{ij} \sim N(0, 1)

This is the function I am using for the simulation:

library(dplyr)

simulate_panel <- function(J, T, 
                           gamma_00, gamma_10, 
                           sigma_u0j, 
                           sigma_u1j,
                           cor_u0j_u1j,
                           sigma_e, 
                           seed){

  
  set.seed(seed)
  
# Calculate covariance for RE
  cov_u0j_u1j = cor_u0j_u1j * sigma_u0j * sigma_u1j

# Generate Sigma 
  Sigma = matrix(c(sigma_u0j^2, 
                              cov_u0j_u1j, 
                              cov_u0j_u1j, 
                              sigma_u1j^2), ncol=2)

# Draw random effects from multivariate normal distribution  
  u = MASS::mvrnorm(n = J, mu = c(0, 0), Sigma = Sigma)
    
  simulated_panel <- data_frame(region = 1:J, # Specify constant region values
                               gamma_00,
                               gamma_10,
                               u_0j = u[,1],
                               u_1j = u[,2], 
                               sigma_e) %>%
    do({   # Replicate region values T times
          row_rep <- function(x, T) {
             x[rep(1:nrow(x), times = T),]}
          
          as.data.frame(row_rep(., T))
          
       }) %>%
    arrange(region) %>%
    mutate(x = rnorm(n(), 0, 1),  # Sample x 
           mu = (gamma_00 + u_0j + x * (gamma_10 + u_1j)),  # Calculate linear predictor
           e = rnorm(n(), 0, sigma_e), # Draw residual 
           y = mu + e) # Calculate y
    
  return(simulated_panel)
}

And, this generates a simulation from the DGP:

simulated_panel <- simulate_panel(J = 50, 
                                  T = 50, 
                                  gamma_00 = .5,
                                  gamma_10 = 1, 
                                  sigma_u0j = .2, 
                                  sigma_u1j = .2,
                                  cor_u0j_u1j = .3,
                                  sigma_e = .2, 
                                  seed = 123)

Using lme4, I can recover the parameters specified in the DGP:

lmer.m <- lmer(y ~ 1 + x + (1 + x | region), data = simulated_panel)
summary(lmer.m)

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 region   (Intercept) 0.03379  0.1838       
          x           0.03272  0.1809   0.34
 Residual             0.04016  0.2004       
Number of obs: 2500, groups:  region, 50

Fixed effects:
            Estimate Std. Error t value
(Intercept)  0.48580    0.02631   18.46
x            1.01698    0.02592   39.24

However, when I fit this model using rstanarm, ESS is low and R-Hat is high.

stan_lmer.m <- stan_lmer(y ~ 1 + x + (1 + x | region), 
data = simulated_panel, chains = 4, cores = 4)


summary(stan_lmer.m, pars = c('(Intercept)', 'x', 'sigma', 'Sigma[region:(Intercept),(Intercept)]',
                              'Sigma[region:x,(Intercept)]', 'Sigma[region:x,x]'), digits=2)

Estimates:
                                        mean   sd   2.5%   97.5%
(Intercept)                           0.48   0.03 0.43   0.54   
x                                     1.02   0.02 0.97   1.06   
sigma                                 0.20   0.00 0.20   0.21   
Sigma[region:(Intercept),(Intercept)] 0.04   0.01 0.02   0.05   
Sigma[region:x,(Intercept)]           0.01   0.01 0.00   0.02   
Sigma[region:x,x]                     0.03   0.01 0.02   0.05   

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.01  115 
x                                     0.00 1.02  144 
sigma                                 0.00 1.00 3157 
Sigma[region:(Intercept),(Intercept)] 0.00 1.01  333 
Sigma[region:x,(Intercept)]           0.00 1.01  272 
Sigma[region:x,x]                     0.00 1.01  356

I know that I can crank up the number of iterations and increase warmup; however, I have fit much more complicated models than this and not observed these kinds of issues. Of course, if that’s the only solution, that I can live with that. However, my full model is more complex and I am concerned about how many iterations I’ll need to estimate the full model.

I initially assumed that something was wrong with my simulation (and still haven’t ruled that out); however, I’ve been beating my head against this for a while and haven’t found a bug. And it’s worth noting that the model is recovering the parameters from the DGP. It’s just that the sampling and mixing seems to be less than optimal.

I should also note that I did recently update rstan and rstanarm, so maybe I’m observing the consequences of changes to how Rhat and n_eff is calculated?

UPDATE:

I just ran the same rstanarm model for 10,000 iterations and observed the following diagnostics:

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.01   645
x                                     0.00 1.01   857
sigma                                 0.00 1.00 15314
Sigma[region:(Intercept),(Intercept)] 0.00 1.00  1408
Sigma[region:x,(Intercept)]           0.00 1.01  1177
Sigma[region:x,x]                     0.00 1.00  1641

I should also clarify that my concern is based driven by my beliefs that:

Sampling for this model should be more efficient
Rhat > 1.01 shouldn’t just be disregarded.

Maybe these beliefs are unfounded and/or too extreme?

bbbales2 · July 31, 2019, 1:04pm

Can you check pairs(fit) and see if there are any correlations?

joeHoover · July 31, 2019, 3:37pm

@bbbales2, it seems that (1) the intercept and intercept random effects and (2) x and it’s random slopes are all highly correlated.

Also, I’ve found that if I increase \sigma_e from .2 to 1, Rhat drops to 1 and the ESS increase by roughly an order of magnitude. I think I recall reading higher Rhat and lower ESS should be observed for a constant; so, perhaps, conditional on the rest of the parameters, \sigma_e = .2 led to insufficient within-cluster variance?

Though, it’s not entirely clear to me why that would translate into posterior correlations between \gamma_{00} and u_{0j}.

Regardless, I guess the important part is that increasing \sigma_e resolves the problem I was observing. I’m happy with this resolution [EDIT: this isn’t as much of a resolution as I initially thought, see post below] as \sigma_e = .2 was an arbitrary choice; though, it does leave me wondering whether it would be possible to more efficiently model a DGP that happens to have low residual variance.

joeHoover · July 31, 2019, 4:01pm

UPDATE

I just ran the same stan_lmer model on the same simulation, with the exception that \sigma_e = \sigma_{u0j} = \sigma_{u1j} = 1 in the DGP. Under those conditions, I observed the same issues with EFF and Rhat as well as comparable correlations between fixed effects and their corresponding random effects in the pairs() plot. I also reran the simulation and model, but with 100 groups instead of 50 (i.e. J = 50). This did not resolve the problem.

So, to summarize, I’m finding that ESS decreases and Rhat increases as \sigma_{u0j} and \sigma_{u1j} approach \sigma_e. I assume I must be doing something wrong here, but I am not sure what.

UPDATE # 2

I just found an old thread where @stijn wrote:

I guess this could be an issue as I am using the correct model? However, I added an unmodeled variable z with \beta_z = .3 to the simulation and this did not mitigate the issue.

Also, consistent with the low ESS, there is high autocorrelation among the samples for the intercept when the random effect SDs and \sigma_e are equal or close.

However, this completely disappears when the random effect SDs = .2 and \sigma_e = 1.

UPDATE

I just increased the effect of the unmodeled variable z_{tj} to 1 and all of the issues disappeared. So, maybe the issue is that I have been modeling the DGP to closely? FWIW, I only started initially observing these issues when I introduced a random slope to the model.

avehtari · July 31, 2019, 7:03pm

10,000 post warm-up iterations per chain? Anyway it’s good that n_eff increases when the number of iterations increases (see [1903.08008] Rank-normalization, folding, and localization: An improved $\widehat{R}$ for assessing convergence of MCMC and the online appendix)

Probably not.

You could try with option metric=‘dense_e’ (see related results in [1905.11916] Selecting the Metric in Hamiltonian Monte Carlo)

Do you get divergences? If not you could try smaller adapt_delta.

Do you see funnels if you plot Sigma vs x or Sigma vs b ?

joeHoover · July 31, 2019, 10:33pm

Yes. However, I just ran three new models with increasing numbers of post-warmup iterations. I’ve included the diagnostics below:

Iter = 2000, warmup = 1000

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.00  180 
x                                     0.00 1.02  191 
sigma                                 0.00 1.00 3264 
Sigma[region:(Intercept),(Intercept)] 0.00 1.01  381 
Sigma[region:x,(Intercept)]           0.00 1.02  270 
Sigma[region:x,x]                     0.00 1.03  340

Iter = 4000, warmup = 1000

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.01   397
x                                     0.00 1.01   513
sigma                                 0.00 1.00 10084
Sigma[region:(Intercept),(Intercept)] 0.00 1.00   901
Sigma[region:x,(Intercept)]           0.00 1.00   706
Sigma[region:x,x]                     0.00 1.00  1183

Iter = 10,000, warmup = 1000

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.01  1122
x                                     0.00 1.00  1524
sigma                                 0.00 1.00 23557
Sigma[region:(Intercept),(Intercept)] 0.00 1.00  2638
Sigma[region:x,(Intercept)]           0.00 1.00  2020
Sigma[region:x,x]                     0.00 1.00  3324

So, yes, increasing the number of post-warmup iterations definitely increases n_eff and seems to decrease Rhat.

No, I haven’t gotten any divergences. I’ve also tried adapt_delta = .999 with 2000 post-warmup iterations per chain and 1000 warmup iterations. I didn’t notice an improvement with that (iter = 2000, warmup = 1000)

Diagnostics:
                                      mcse Rhat n_eff
(Intercept)                           0.00 1.04  157 
x                                     0.00 1.02  277 
sigma                                 0.00 1.00 3368 
Sigma[region:(Intercept),(Intercept)] 0.00 1.02  263 
Sigma[region:x,(Intercept)]           0.00 1.00  320 
Sigma[region:x,x]                     0.00 1.01  428

I guess there is some funneling (iter = 2000, warmup = 1000):

However, there are no divergences.

Overall, it looks like I can crank up the post-warmup iterations to achieve acceptable n_eff and Rhat. And from looking at the autocorrelation plots, it’s apparent that there is substantial autocorrelation for all parameters:

Again, this autocorrelation disappears when \sigma_e is more than twice as large as the SD for the random intercept and slope. It also disappears when I introduce an unmodeled variable z_{it} with an effect of 1 to the simulation.

So, is it fair to assume that the sampling issues that I’m observing are caused by insufficient variation in the sample? If that’s the case, I’m more than happy to just increase \sigma_e. However, if the problem lies elsewhere and inducing more residual variation is just an accidental fix; I’d very much like to know that!

avehtari · August 1, 2019, 7:05am

If n_eff grows proportionally to iter, then it’s more likely that sampling is ok (but it’s not guaranteed)

High adapt_delta is making step size smaller and autocorrelations higher. If you don’t get divergences you could make adapt_delta smaller, e.g. 0.8

Plots for intercept look very suspicious with those “tentacles” coming out of the bulk.

I guess this is related to centered vs. non-centered parameterization. It is known that centered would be better for strong likelihood (compared to prior, ie small \sigma_e, what you call small variation in sample) and non-centered would be better for weak likelihood (compared to prior, etc). rstanarm is using non-centered parameterization by default.

bbbales2 · August 1, 2019, 4:37pm

Yes this is what I’d try too. Here’s an interesting thread on this: Partial non-centered parametrizations in Stan

If you build your model with brms (which has similar syntax to rstanarm) you can generate code/data with make_stancode and make_standata and then do the centering yourself.

joeHoover · August 2, 2019, 12:07am

Thanks @bbbales2 and @avehtari! This makes sense. I have pretty good reason to expect that a non-centered parameterization will perform better on my real data; however, I (stupidly) wasn’t thinking about the potential interactions between centered vs. non-centered parameterizations and my simulations.

Thanks for helping me figure this out!

Topic		Replies	Views
Low effective sample size after running Bayesian cognitive model in Stan Modeling rstan , fitting-issues	8	803	August 18, 2021
Huge RHATs, Extreme Model Performance Variability Depending on Priors Modeling	10	893	September 23, 2020
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6052	July 24, 2020
My first Stan model - hierarchical logistic regression Modeling cognitive-science	28	8334	June 6, 2017
Request for help understanding model behavior Modeling	17	1513	March 24, 2018

Low ESS and High RHat for Random Intercept & Slope Simulation (rstan and rstanarm)

Related topics