High Pareto-k values with Binomial GLMM — Looking for suggestions/alternatives

I’m modeling some data on children’s understanding and learning of an important pre-algebraic concept. When you give U.S. 7-to-11-year-olds an assessment with relevant items, most children will answer all of the items incorrectly, a decent minority will answer most of the items correctly, and the rest will be in the middle (example distribution below).

example histogram

Historically, these data have been treated as either A) Normal in ANOVAs/t-tests (which naturally yields terrible posterior predictions) or B) categorical (e.g., completely incorrect vs. at least 1 correct, which can toss out a lot information depending the sample).

More recently, some have gone the binomial GLMM route (with random item intercepts/slopes). I’m finding such models consistently run into trouble with high Pareto-k values. For example, we might ask whether something like working memory capacity (WMC) predicts children’s performance on a set of items with the following model.

example_mod <- stan_glmer(correct ~ zWMC + (1|id) + (1|item), family = binomial, pretest_data)


Current thinking is that the binomial GLMM approach is a good way to handle these data, but LOO diagnostics pretty much always look something like the above. I’m guessing the distribution of the (summed) correct item responses is part of the problem? Any insights or recommendations for alternative approaches would be greatly appreciated!

BONUS: Experiments in this area often involve pretest and posttest measures after being randomized to different types of instruction. Due to the shape of these data, the most popular way to assess effectiveness of interventions and other predictors of learning is to analyze the posttest of individuals with no/little pretest knowledge (and ignore individuals with partial understanding). An approach that could incorporate these middle cases would very be useful.

1 Like

I think the first thing I would check is what are the default priors in the stan_glmer. And do those priors back sense given domain knowledge. Another way would be to code the model up in brms and see what shakes loose as long as the default priors in brms make sense.


Fully agree with @Ara_Winter. It’s also worth emphasizing: high Pareto-k values are evidence that you should not trust model comparisons based on PSIS-LOO, waic, or related measures. However, they are NOT necessarily evidence that your model is misspecified or has problems. They indicate that some points have strong influence over the model posterior. This can arise in (at least) three ways.

  1. You could have a misspecified model whose posterior gets pulled around in weird ways by points that it has fundamental difficulties in fitting.
  2. You have a model where some points have extreme leverage.
  3. You have a flexible model with lots of parameters, some of which are not well identified once you start leaving data points out. For example if only a small number of data inform some particular element of a random effect parameter, then leaving one of these out might change the posterior considerably.

From what you describe, your high k’s might well be due to this third issue. Model criticism and checking is still important in this case, but you can’t necessarily rely on the pareto-k’s from PSIS-LOO to provide that for you. For what it’s worth, there are other diagnostics available from loo to tell you how worried you should be that high pareto-k’s are indicative of misspecification. For more, see Cross-validation FAQ