Estimating Subject Level Effects for New Subjects in IRT, or multilevel models in general


I am working on an item response theory (IRT) model using brms based on @paul.buerkner 's excellent tutorial paper (, but my question applies to any model with subject/participant level effects. This is a long post but at its core all I’m asking is what people think about approximating estimates for ‘random’ effects for new grouping levels, as this is a common goal in practice.

To provide context for my question, I’ll use the first example from that paper:

# ----------- Code for Section 5.1 ------------
# Analysis of the VerbAgg data set using dichotomous IRT models
data("VerbAgg", package = "lme4")

# get an overview of the data
head(VerbAgg, 10)

# ---------- 1PL models ----------------------
# specify a 1PL model in brms
formula_va_1pl <- bf(r2 ~ 1 + (1 | item) + (1 | id))

# specify some weakly informative priors
prior_va_1pl <- 
  prior("normal(0, 3)", class = "sd", group = "id") + 
  prior("normal(0, 3)", class = "sd", group = "item")

# fit the 1PL model
fit_va_1pl <- brm(
  formula = formula_va_1pl,
  data = VerbAgg,
  family = brmsfamily("bernoulli", "logit"),
  prior = prior_va_1pl,
  seed = 1234,
  file = "fit_va_1pl",
  cores = 4

# extract person parameters
ranef_va_1pl <- ranef(fit_va_1pl)
(person_pars_va_1pl <- ranef_va_1pl$id)

# ---------- End Paul Code ----------------------

In this case, Paul fit a simple model and was able to extract the person effects. In practice, however, one may want to estimate the person level effect for out of sample participants. Using the example above, it may mean that we only had the first 100 subjects in the study, and then subject 101 comes along and we want to estimate their ‘trait’.

# We can fit the same model for the first 100 ids
VerbAgg100=as_tibble(VerbAgg) %>%
  filter(as.integer(id)<101) %>%
  mutate(id=factor(id,levels = 1:100))

fit_va_1pl_100 <- brm(
  formula = formula_va_1pl,
  data = VerbAgg100,
  family = brmsfamily("bernoulli", "logit"),
  prior = prior_va_1pl,
  seed = 1234,
  file = "fit_va_1pl_100",
  cores = 4

# extract person parameters
ranef_va_1pl_100 <- ranef(fit_va_1pl_100)
(person_pars_va_1pl_100 <- ranef_va_1pl_100$id)

# Then, pretend we want to estimate id 101
new_data=as_tibble(VerbAgg) %>%
  filter(as.integer(id)==101) %>%
  mutate(id=factor(id,levels = 101))

# the effect for id 101 from the full model would be

Obviously, this is not strictly possibly, as the new participant (or new level of a grouping factor, more generally) was not present in the data. However, I want to know what people think about the following approaches to (kind of) estimate it:

  1. The simplest approach would be to simply add the new participant’s data to the existing model data and use update() to refit the model with data including our patient of interest. The benefits here are obvious, with the downside being that we have to refit the model for every new participant and the fact that the estimates for our original cohort are different than the published/original values.
Here is an example:
# Adding the case and refitting

fit_va_1pl_100_new <- update(fit_va_1pl_100,newdata=VerbAgg100_new)

# extract person parameters
ranef_va_1pl_100_new <- ranef(fit_va_1pl_100_new)
(person_pars_va_1pl_100_new <- ranef_va_1pl_100_new$id)
  1. Another option would be that we use predict() and allow_new_levels() to generate a lot of predictions for our new participant. However, instead of just sampling from the ‘space of potential levels’ once per draw (using sample_new_levels = “uncertain”), we generate a large number of predictions for each sample. Then we filter these to only keep the prediction, for each sample, that most closely matched the observed data. This narrows the uncertainty around the prediction to values that fit our data better, but is obviously not fully Bayesian and likely will have narrower estimates than #1.
Here is an example:
## Option 2: We draw from the posterior for a range of person effects and then select those that fit the data best
predict_new_level=function(model,new_data,n_draws=100 ){
  for (draw in 1:n_draws) {
    temp_draw=prepare_predictions(fit_va_1pl_100,newdata = new_data,allow_new_levels = T,sample_new_levels = "uncertainty",nsamples = n_samples,subset = 1:n_samples)
    intercept=tibble(intercept=temp_draw$dpars$mu$fe$b[,1]) %>% mutate(sample=1:n())
    person_effect=tibble(person=temp_draw$dpars$mu$re$r$id[,1]) %>% mutate(sample=1:n())
    item_effects=as_tibble(temp_draw$dpars$mu$re$r$item) %>% mutate(sample=1:n()) %>% gather(name,item,-sample)
    temp_results=full_join(item_effects,person_effect,by="sample") %>% full_join(intercept,by="sample") %>% mutate(draw=draw)
    results[draw,1]=nest(temp_results,data = everything())


results_prediction=results %>% 
  group_by(sample,draw) %>% 
  mutate(real=ifelse(new_data$r2=="Y",1,0)) %>% 
  mutate(prediction=inv_logit_scaled(intercept+item+person)) %>%

results_final=results_prediction %>%
  summarise(error_bin=sum(prediction_bin!=real), error_con=sum(abs(prediction-real))) %>%
  ungroup() %>%
  group_by(sample) %>%
  filter(error_bin==min(error_bin)) %>%

# We can either keep all of the results with equal prediction accuracy
person_results1=results_final %>%
  ungroup() %>%
  left_join(results %>% 
              select(-name,-item) %>% 
              distinct(),by=c("sample","draw")) %>% 
  select(-draw) %>% 
  distinct()  %>% 

# Or sample a random one for each draw from the original model posterior
person_results2=results_final %>% 
  ungroup() %>% 
  group_by(sample) %>% 
  sample_n(1) %>% 
  left_join(results %>% 
              select(-name,-item) %>% 
              distinct(),by=c("sample","draw")) %>%
  select(-draw) %>% 
  distinct() %>% 

# using this method, the effect for 101 would be
predict_101=c("Estimate"=median(person_results2$person),quantile(person_results2$person,probs = c(0.025,0.975)))
  1. A final option would be that we extract the intercept and item effect coefficients from our model (or more generally, everything except the grouping level of interest specific effects) and sum these to get the expected value for each sample and item level without considering the person effects (call it FixedTerm). We can then merge this with our new data (which will be expanded nsamples times). Then we fit a new model of the form r2 ~ 0 + sample + FixedTerm, with the ‘prior’ for FixedTerm set as constant(1). This equates, in my mind, to finding credible values for the intercept for each sample, given the FixedTerm for that level of sample from the original model & our new outcome variables. I have not seen anyone do something like this, so it may be completely insane!
Here is an example:
# Extract the values that are 'fixed'
effect_item=ranef(fit_va_1pl_100,summary = F)$item %>% as_tibble() %>% mutate(sample=1:n()) %>% gather(Item,InterceptItem,-sample)
effect_intercept=fixef(fit_va_1pl_100,summary = F) %>% as_tibble() %>% mutate(sample=1:n())
effect_constant=full_join(effect_intercept, effect_item, by="sample") %>% mutate(FixedTerm=Intercept+InterceptItem) %>% mutate(Item=str_remove(Item,"\\.Intercept"))
fixed_data=new_data %>% mutate(Item=as.character(item)) %>% full_join(effect_constant, by="Item") %>% mutate(sample=as.character(sample))

# Doing it for all samples takes forever, so lets sample some samples!
fixed_data_500=fixed_data %>% filter(sample %in% as.character(sample(1:4000,500,replace = F))) %>% mutate(sample=factor(sample))

               family = brmsfamily("bernoulli","logit"),
               prior = set_prior(prior = "constant(1)",class = "b", coef = "FixedTerm")+set_prior(prior = "normal(0,2)",class = "b"),
               file = "fixed_model_500_3")

# extract sample parameters
fixef_fixed_model_500_3 <- fixef(fixed_model_500_3,summary=F)
(person_fixed_model_500_3 <- fixef_fixed_model_500_3[,1:500])
fixed_3_101=c("Estimate"=median(inv_logit_scaled(as.vector(person_fixed_model_500_3))),quantile(inv_logit_scaled(as.vector(person_fixed_model_500_3)),probs = c(0.025,0.975)))

In our hypothetical example of estimating the trait for participant #101, we get the following:

I think the most valid/accepted option would likely be the add the new data and refit the model. However, in some cases this will take an enormous amount of time for each case. What do you think about the other options? What have other people done, if something completely different?

Thanks so much


Please also provide the following information in addition to your question:

  • Operating System: Windows 10
  • brms Version: 2.13.11
1 Like

Sorry for taking to long to answer - your question is relevant and well written.

You are AFAIK correct. Unfortunately how to actually avoid the need to refit is AFAIK an open research problem. I think all of the other options you give are not great - as you can see from how much they differ from the full model approach.

Some additional approaches to consider:

  • One can use the posterior of the parameters from the first fit as a prior for the parameters in the new fit which would then have only the new subjects as data (and should thus be fast to fit). There is a bunch of traps along the way, but it is something I was able to get to (at least roughly) actually work (see my StanCon 18 submission). In particular, if you happen to have a lot of data for the initial model, than the posterior might be sufficiently close to multivariate normal to let you do this. I think a good summary of the important points is at Composing Stan models (posterior as next prior) and Using posteriors as new priors - #6 by betanalpha . I also remember (but cannot find the link) someone mentioning fitting a neural network to the posterior, passing the neural network params to build a prior for the next model. The problem is that passing a global distribution with between-parameter correlations to brms would require some hacking around (you could likely achieve that via stanvars though).

  • Since the one new data point should not change the posterior very much, you could somewhat speed up refitting of the whole model by passing the adaptation information from the original model (step size, mass matrix) - I am not sure this is easily achievable via brms, but might be worth some investigation. You could also use posterior means of the original fit as inits. If you can pull this off, you could most likely use much shorter warmup (or even avoid warmup completely).

  • The “predict and filter” approach seems somewhat related to importance sampling. Now, this is a wild guess (I’ve never done this or seen it done), but maybe using the old posterior as a proposal distribution and the use the likelihood of the full model for weighting could work quite well. Unfortunately, brms AFAIK doesn’t let you just compute the likelihood, so you would have to either rewrite the likelihood yourself or hack around brms to expose it…

  • Also [1412.4869] Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data seems related, but I don’t understand the method and once again I don’t think there is a ready-made implementation.

All of the approaches I describe would IMHO likely suffer from some sort of drift so you would need to refit the whole model occasionally anyway, so all in all, I am not sure avoiding the refitting is worth the hassle. Focusing on speeding up the refitting (e.g. by using the recently added support for within-chain paralellization in brms or by testing whether you can get away with shorter warmup/sampling) could possibly be a better investment.

Best of luck with your project!

1 Like

That is extremely helpful and thorough and will helpfully guide others if they have the same questions!

1 Like