Ordinal mixed effects regression: low ESS on intercept sd + very slow sampling

Hi, I’m trying to model micro-longitudinal self-report data with Likert-scale items in brms. Each participant contributed multiple responses, and the response variable was measured with several items, so I used a mixed effects model with random intercepts across participant ID and response item. Because both the response and the one predictor are ordinal, I used a cumulative likelihood and modeled the predictor as monotonic. I’ve used what I think would be weakly informative priors on all parameters.

The first time I tried fitting the model, it took a really long time to finish sampling (~ 6 hours). It also had one divergent transition, and low ESS on the standard deviation on participant ID and on the ordinal thresholds/intercepts (see output below). There’s a lot of data (~ 40,000 obs), so I think that may be one of the reasons why the sampling took so long. I tried sampling a fraction of the data and the model samples faster, although it’s still quite slow.

 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: dfs_score ~ 1 + mo(dfun) + (1 | ID) + (1 | dfs_item) 
   Data: data_mixed_training (Number of observations: 3866) 
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Group-Level Effects: 
~dfs_item (Number of levels: 8) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.36      0.13     0.18     0.68 1.01      736     1419

~ID (Number of levels: 50) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     1.53      0.16     1.26     1.88 1.02      319      713

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -4.30      0.33    -4.98    -3.65 1.02      180      620
Intercept[2]    -2.45      0.31    -3.06    -1.87 1.03      142      561
Intercept[3]    -1.09      0.30    -1.69    -0.52 1.03      138      524
Intercept[4]     0.95      0.30     0.36     1.54 1.03      142      471
Intercept[5]     2.86      0.31     2.28     3.45 1.02      147      522
Intercept[6]     5.35      0.31     4.74     5.97 1.02      156      511
modfun           0.87      0.07     0.74     1.00 1.00     1186     2202

Simplex Parameters: 
           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
modfun1[1]     0.30      0.04     0.22     0.37 1.00     1479     1766
modfun1[2]     0.27      0.03     0.22     0.33 1.00     1709     2136
modfun1[3]     0.27      0.04     0.20     0.35 1.00     2056     2036
modfun1[4]     0.16      0.06     0.05     0.27 1.00     1049     1153

Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

I’ve tried dropping/including parameters and found that the problem seems to occur when the random intercepts across participants are included. Somehow they seem to mess with the ordinal thresholds/intercepts. Based on mcmc_pairs() plot, the posterior of all thresholds is highly correlated (see picture below), not sure what that means. I’ve never seen something like this before, so maybe it’s something obvious that I’m missing.

Do you know what might be the issue? Cheers

1 Like


Update: I found there was high autocorrelation on the intercept parameters. Was able to get a good ESS (~2000) on all parameters with thinning (I used thin = 2). However, that was when subsampled the data to 4,000 obs. When I use all of the data (40,000 obs), it says in the viewer that 1,000 transitions will take about 1,500 s.

1 Like

If you have a problem with sampling, then thinning just hides the problem, it doesn’t really fix it.

Not familiar with brms, but usually one needs to handle highly correlated one way or another. Stan documentation has some examples, but for brms I can’t tell.

Yeah I don’t really have much experience with thinning but I didn’t think it would fix everything. Any ideas about what could be going on? I reckon that the high autocorrelation could mean that the sampler is moving across the posterior too slowly, which in turn might be caused by a ridge in the posterior created by the high correlation among the intercepts/thresholds. But I can’t for the hell of me figure out why the thresholds become correlated after I add a random intercepts across participants.

Please use the Interfaces - brms tag for brms related questions. Otherwise, I will likely overlook them.

I think the high correlation between thresholds and person intercept are plausible since higher values of the person intercepts will equivalently imply lower thresholds (and vice versa). I would recommend just running more samples. Doubling the size, by running 1000 warmup and 3000 or even 4000 iterations per chain should be sufficient. I don’t think anything serious is going on here. Just a hard to fit model.

Awesome, thank you Paul! Just to clarify, the thresholds & person intercepts aren’t correlated (at least the few I checked), it’s the thresholds themselves that are all highly positively correlated among each other, e.g. threshold[1] and threshold[2]. I imagine that means that the entire set of thresholds can be shifted up or down? I don’t understand how that might be caused by adding in random intercepts for participants - the mean of those intercepts should be centered at zero and only sd gets estimated, right? So why would it allow all thresholds to shift?

UPDATE: I was able to get good ESS (~ 2,000) by using a much more aggressive normal(0, 0.25) prior on the standard deviation of the random effects. I still get the high estimate for random effect sd on ID (~ 1.5) that’s outside the range of the prior, and the thresholds are still correlated, although less so now.

1 Like

The thresholds being correlated is expected in itself because, in the cumulative family, thresholds are ordered. So if one threshold is higher, other thresholds will tend to have a higher value as well. If there are more terms potentially fighting for the same variance (i.e… varying intercepts), this effect may become stronger. Remember that the centering of varying effects is only a softcentering via a hierarchical prior, which is usually sufficient to identify the model but not necessarily trivial to fit.

1 Like

Right, cheers Paul. So what do you think is the best approach? Running the model with more iterations? I’m still struggling with the long sampling time + the 1 cheeky divergent transition (I hope it will go away if I increase adapt delta & let the model run). Or is there maybe some way to anchor one of the thresholds so that the sampling doesn’t misbehave so much? I’m just shooting from the hip here, but I was thinking something similar to dummy-coded predictors, where one of the levels gets used as reference and set to zero and the others are computed as contrasts.

Thank you very much for your time!

1 Like

A little more sampling should do the trick. Or try out another ordinal family such as sratio() or acat() without ordered thresholds.