Brms: mi() for discrete outcomes in an IRT model

Just for transparency and clarification, the initial post was about using mi() for missing outcome variables in a non-linear model. That ultimately is not an issue; instead, it’s that the missing outcome is a discrete variable, which Stan takes issue with. The original question has been revised to reflect more relevant issues.

I have item-level data for several different cognitive tests that I’d like to fit with IRT models. As with any dataset, there are missing data. Some of those missing data are from people just not being given the test, but a smaller proportion are from people refusing to answer a particular item. In the case of people not administered a test, I don’t really care about their missing data; however, for those who refused to answer an item, I’d like to be able to use the rest of their item responses to impute what their response might have been.

Similarly, there are some missing data points corresponding to someone answering “don’t know” to an item. Traditionally, these missing points are treated as errors, but I suspect that they’re better accounted for by a separate data generation process that I’d like to ideally model alongside the imputation of refusals and model-fitting for those with complete data.

The simplest case for the model formula with some missing outcomes would be the following:

formula <- bf(response | mi() ~ beta + exp(logalpha)*theta,
              theta ~ 0 + (1 | ID),
              beta ~ 1 + (1 |i| Item),
              logalpha ~ 1 + (1 |i| Item),
              nl = TRUE)

In essence, the goal is to estimate the parameters of the IRT model and use that as a way of predicting what the missing data for a particular person would be; however, the missing data are discrete variables (0/1) and thus not Stan-friendly with the out-of-the-box mi() function. I know there are examples for estimating discrete parameters on this forum and in the Stan documentation; however, the mathematical details of specifying a model that marginalizes out the discrete parameters are beyond my applied statistics skills.

Related to my thoughts on modeling a separate data generation process for missing due to “don’t know” versus refusals, I don’t have a good sense for how to approach the problem. My thought was some kind of mixture model, but I don’t know how to signal to Stan/brms that I want to impute missing data conditionally on the kind of “missingness”.

If it helps at all, what I hypothesize is that those who have “don’t know” responses represent different groups of individuals. Some of those are genuine “no clue, can’t tell you” answers that should be counted as errors, but some are that are “I don’t want to think too hard about this and just want to get out of here” answers that might not reflect true ability. If I can specify this as a mixture model, then I’ve got a multinomial/categorical likelihood for the mixtures (responds, refuses and doesn’t know, and refuses but could know), and I have some predictors of those group memberships (e.g., depression, subjective cognitive complaints, dementia worry, etc).

Desired Solutions
Hopefully this detail is enough to understand the modeling problem that I’m facing, and I’m open to any and all solutions that the community may have. A solution that works within brms is ideal, but I’m fine as well just moving to straight Stan. I’m trying to avoid learning BUGS or JAGS just for this case, though perhaps the discrete parameters will render that an impossibility

1 Like

are you sure the formula you are showing is not working? I think the thread you linked to referrerd to mi() occurring on right hand sides of formulas not on left hand sides.

I haven’t actually tried the formula yet since I was still working on wrapping my head around the modeling problem. While I was looking through documentation on the mi() function, I just saw that it didn’t work in non-linear models, but I wasn’t paying any attention to left versus right sides of the formulas.

From the prior discussion I linked to, I would’ve thought that the mi() function would work for my case since I’m asking for imputation from a linear element, but again, I also wasn’t paying attention to which side of the formula I was looking at and just kept seeing that non-linear was the problem.

I’ve got another model running right now, but I’ll try it once that finishes up. Seems like maybe that’s not an issue then (though I’ll confirm). Any thoughts on the mixture piece by chance?

Well, mi() doesn’t work but just not for the reason I was anticipating. The error I get when attempting is that mi() is not supported for Bernoulli/binomial likelihoods

The main thing to be aware of is that mi() treats the missing data as a parameter to be estimated, and Stan (well, HMC) does not support discrete parameters

True: I forget about that issue until situations like this. Is that still true when a different estimator is used (e.g., using CmdStanR’s variational Bayes)?

In this case, the imputation is the product of a logistic regression, so I suppose that rounding the resulting logit could be used to guess the missing values. I recall running into this same problem with a latent profile model I was trying to write in Stan a year or so ago. I seem to recall the need to use log_sum_exp(...) over the various class distributions (or maybe it was log_sum(...) that I had use - I don’t recall now). I’m not sure whether that solution would work for imputing missing values in a binary variable, though

I’m 99% sure that the following code is not going to work(it does at least compile), but I’m trying to wrap my head around how to modify the model to do what I’m looking for. The primary inspiration for imputing the missing values comes from this post on the discussion board. The meat of that solution to marginalizing out the missing binary outcome variable was including a loop over the missing values like this:

for(n in 1:N_missing) {
  target += log_mix(Ymi[n], bernoulli_logit_lpmf(1 | mu),
                            bernoulli_logit_lpmf(0 | mu));

Where Ymi is a vector of the probabilities that the missing data are 1 and (in my case) mu is the result of the item response model (i.e., beta + exp(logalpha) * theta). My hope is that I can use the IRT parameters estimated from those with observed data can be used to estimate the probability that a missing response is 0/1. I think this is helped by the fact that no one is missing responses to all the items, so there is still an opportunity to learn something about the latent trait from the partial item responses and then use that information to inform estimation of the missing responses. I understand that the mirt package is able to generate estimates for IRT models with missing data using full-information maximum likelihood, so I know there’s a way of approaching this problem – just not sure how to implement a Bayesian and Stan-friendly version of it.

Additionally, instead of trying to build in a mixture model to reflect my beliefs about different causes of missingness, I wanted to try to build in the prediction of the Ymi proportion via a beta regression. I’ve never written a beta regression in Stan, so I’m not anywhere close to confident in my implementation here. Additionally, I don’t know that a beta regression is going to really improve any estimation, but my desire is to test whether certain covariates (e.g., depression) might be related to the missingness and inform performance estimation. That’s the idea there, at least.

// generated with brms 2.15.0
functions {
  /* compute correlated group-level effects
  * Args: 
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
    *   matrix of scaled group-level effects
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);

data {
  int<lower=1> N;  // total number of observations
  int Y[N];  // response variable
  int<lower=0> Nmi;  // number of missings
  int<lower=1> K_beta;  // number of population-level effects
  matrix[N, K_beta] X_beta;  // population-level design matrix
  int<lower=1> K_logalpha;  // number of population-level effects
  matrix[N, K_logalpha] X_logalpha;  // population-level design matrix
  int<lower=1> K;  // number of population-level effects for beta regression component
  matrix[N, K] X;  // population-level design matrix for beta regression component
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_theta_1;
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_beta_1;
  vector[N] Z_2_logalpha_2;
  int<lower=1> NC_2;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?

transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc; // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i-1];

parameters {
  vector<lower=0,upper=1>[Nmi] Ymi; // probability that missing response is 1
  vector[K_beta] b_beta;  // population-level effects
  vector[K_logalpha] b_logalpha;  // population-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  matrix[M_2, N_2] z_2;  // standardized group-level effects
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix
  vector[Kc] b;  // population-level effects for beta regression component
  real Intercept;  // temporary intercept for centered predictors of beta regression component
  real<lower=0> phi;  // precision parameter for beta regression component

transformed parameters {
  vector[N_1] r_1_theta_1;  // actual group-level effects
  matrix[N_2, M_2] r_2;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_2] r_2_beta_1;
  vector[N_2] r_2_logalpha_2;
  r_1_theta_1 = (sd_1[1] * (z_1[1]));
  // compute actual group-level effects
  r_2 = scale_r_cor(z_2, sd_2, L_2);
  r_2_beta_1 = r_2[, 1];
  r_2_logalpha_2 = r_2[, 2];

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] nlp_theta = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_beta = X_beta * b_beta;
    // initialize linear predictor term
    vector[N] nlp_logalpha = X_logalpha * b_logalpha;
    // initialize non-linear predictor term
    vector[N] mu;
    // initialize linear predictor term
    vector[Nmi] beta_mu = Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_theta[n] += r_1_theta_1[J_1[n]] * Z_1_theta_1[n];
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_beta[n] += r_2_beta_1[J_2[n]] * Z_2_beta_1[n];
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_logalpha[n] += r_2_logalpha_2[J_2[n]] * Z_2_logalpha_2[n];
    for (n in 1:N) {
      // compute non-linear predictor values
      mu[n] = nlp_beta[n] + exp(nlp_logalpha[n]) * nlp_theta[n];
    for(n in 1:Nmi) {
      // computing mixing ratios
      beta_mu[n] = inv_logit(beta_mu[n]);
    target += beta_lpdf(Ymi | beta_mu * phi, (1 - mu) *phi);
    for(n in 1:Nmi) {
      // handle missing outcomes
      target += log_mix(Ymi[n], bernoulli_logit_lpmf(1 | mu),
                                bernoulli_logit_lpmf(0 | mu));
    target += bernoulli_logit_lpmf(Y | mu);
  // priors including constants
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
  - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(z_1[1]);
  target += student_t_lpdf(sd_2 | 3, 0, 2.5)
  - 2 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(to_vector(z_2));
  target += lkj_corr_cholesky_lpdf(L_2 | 1);
  target += student_t_lpdf(Intercept | 3, 0, 2.5);
  target += gamma_lpdf(phi | 0.01, 0.01);

generated quantities {
  // actual population-level intercepts for beta regression component
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_2) {
    for (j in 1:(k - 1)) {
      cor_2[choose(k - 1, 2) + j] = Cor_2[j, k];

I want to make sure that conceptually what I’ve outlined above for handling the missing outcome variables is reasonable or sensible within Stan and using IRT data. I have a sense that what I want is possible, but I’m an applied researcher and the technical realities of identifiability and implementation are often hard for me to recognize until I’ve played with the models.

I’d also love to know what in my code is wrong. I tried to adapt everything to be in line with brms language and standards, but some of the things that brms does to improve efficiency are just well beyond my own Stan understanding. Ideally, I’d like to be able to run this with a QR decomposition as well, but my first priority is getting the general framework up and running. Personally, I think that being able to handle missing item responses in IRT applications would be a great option for those of us interested in Bayesian IRT with Stan, so I hope that a solution to this problem will be of use to people other than myself.

1 Like

I fear you are trying to achieve an unreasonable goal. If you have missing data in response, I don’t think you can do better than to not use those rows for fitting and “impute” the responses afterwards via model predictions. See Imputing missing responses in multivariate ordinal data for a discussion of this. Feel free to ask for clarifications here,if the thread is hard to follow.

Best of luck with your model!

1 Like

I think I generally understand the problem from the thread, and I initially was planning to do exactly as you said: run the data in long-format and let brms drop the rows with NA for the response variable with the posterior always being available to me to impute missing responses. As I thought about the problem and learned more about the data (this is a 3rd party dataset, so I’ve been reaching out and asking some clarification things to learn more about the documentation), I started to think that the types of missingness were non-ignorable/non-equivalent and that I might have some additional information within the data to inform the missing not at random process.

I know that I ran through these thoughts at the top of the discussion, but maybe it’s possible for someone to tell me where in this logic my intuition is off. Basically, I have item responses for 30 questions on a test of general cognitive ability from 4 different countries. Responses to items are recorded as either correct (1), incorrect (0), not administered (.na), refusal to attempt (.r), and refusal to guess (.dk). For clarification, items were not administered (.na) if the person had a physical limitation to completing the item (e.g., unable to read, unable to hold a pencil). A refusal to attempt an item meant the person just outright refused to do anything for the item while a refusal to guess was coded when a person responded with a “don’t know” response and refused to provide an alternative response.

From a data generation perspective, as I thought about the different response processes, I increasingly felt like relying on the estimated item parameters and participant’s partial response patterns was missing these other factors. I considered adding in person-level covariates related to estimating the type of response (i.e., correct vs incorrect vs missing due to X reason), but that would ultimately require that missing values are included in the model and a different likelihood specification than the Bernoulli. Additionally, I don’t particularly want to know what predicts the kind of missingness per se; instead, I’d prefer to know what someone’s response to an item might be given the kind of missingness they have plus other information we know about the person.

From there, I started thinking about whether I could use other information from my clinical experience to make better guesses about the ability of individuals to answer some items that are coded as missing due to different reasons. Clinically, we expect “don’t know” responses more commonly in people with certain background variables (e.g., higher depression scores, more fatigue, older subjective age, etc), and when we interpret these cognitive tests, we are often thinking whether or not someone who says “don’t know” actually has the ability to answer that item correctly or not. IRT provides a very direct way of informing the probability of a correct response (i.e., the theta parameter), so it seemed like a natural way of “imputing” the missing response.

Still, that doesn’t utilize any information about the kinds of missing responses that I have in the data. As I noted in my initial post, I thought about using a mixture model to essentially get some kind of latent classes (e.g., a class of “don’t know” and refusers who are expected to answer correctly versus another class expected to answer incorrectly). When I found the other post on handling missing discrete responses, I thought maybe a beta regression predicting the log_mix ratio could be a more efficient way of getting the same information. Basically, I’d plug in these clinical variables that we believe to be informative about the kinds of responses to get the log_mix ratio and then get a more direct estimate of the missing response via the IRT parameters estimated from the sample + the participant’s partial response pattern.

As I’ve thought about this, if items were just missing because they weren’t administered due to physical limitations, then I’d feel fine just using the posterior to “impute” the responses (if I was ever interested in those values). The fact that there are then the .r and .dk missing responses is where the wrench got thrown in since my clinical experience tells me that this is not missing at random and that I have information that can be used to flesh out that missing data generation. Perhaps the issue then is that I’m trying to make a model that will reflect the data generation process a little better but at no additional benefit to inference, validity, or utility?

To me this looks as you don’t really have missing data. The data are complete, they just have a more complex structure than just the response.

That actually makes a lot of sense to me. You could either fit a categorical regression taking all the 5 categories as options, or (maybe even better) fit a 4-category categorical regression to predict the type of response (guessed, .na, .r, .dk) and then a model with binary response, just for the subset where the subjects actually guessed. You should be able to combine those into a multivariate model and enforce a correlation structure between various predictors (feel free to ask if you are unsure how to achieve those in brms).

I think a correlation structure between predictors could achieve this goal. E.g. if you put a (1 | gr(patient_id, id = p)) as a predictor for both parts of the model, you will get an estimate of the correlation between a varying intercept for each category and varying intercept for correct response, e.g. a correlation between “tendency of person to respond correctly” and “tendency of person to refuse to guess”.

Would that make sense?


I wasn’t aware that these kinds of simultaneous models were possible in brms: it feels like a whole other world of opportunities just opened up in my head. I’d seen the gr(...) documentation in phylogenic model vignettes, but I don’t do that kind of research and never thought I had a reason to use it or explore it any further. I was under the impression that the only way to model correlation/relationship among multiple models was by bf(...) + bf(...) + set_rescor(TRUE), and since set_rescor() is supported for just normal and student t families, it wasn’t occurring to me to try modeling these two things together.

I want to make sure that I have an appropriate conceptual understanding of what is being done. The fact that the random effects are being correlated across the two models makes me think of joint survival and longitudinal models. Is that a fair approximation of what is happening?

As far as the brms syntax, would it be something like the following:

resp.type ~ bf(response_type ~ x1 + x2 + (1 | gr(ID, id = p)),
               family = categorical(link = "logit"))

2pl.model ~ bf(response | subset(guessed) ~ beta + exp(logalpha) * theta,
               theta ~ 0 + (1 | gr(ID, id = p)),
               beta ~ 1 + (1 |i| Item),
               logalpha ~ 1 + (1 |i| Item),
               nl = TRUE, family = bernoulli(link = "logit"))

fit <- brm(resp.type + 2pl.model, ...)

Since my data come from multiple countries, I’ve also been considering extending the random effects to include that participants are nested in countries (I may even go so far as 3 levels: country + region + person when some additional countries’ data become available). Would it be correct to say that the grouping term then becomes this: gr(ID, by = Country, id = p)?

1 Like

:-D yes, brms let’s you do wonderful things. It however also let’s you build models that are hard to debug and understand, so it’s a mixed bag :-)

I think the comparison to joint survival + longitudinal models is not very close. In joint surv + long models, there is a bit of a tighter coupling as the longitudinal estimate serves as a predictor in the survival model. Here, we are just saying that some coefficients might be correlated. Or, to be precise, that the hyperprior is not univariate normal, but multivariate normal.

So assuming (for simplicity) we have just two linear regressions bf(u ~ (1 | gr(ID, id = p))) + bf(v ~ (1 | gr(ID, id = p)) we are saying that:

\mu_{u,i} = \beta_{u,1} + \alpha_{u, ID[i]} \\ \mu_{v,i} = \beta_{v,1} + \alpha_{v, ID[i]} \\ \left( \begin{matrix} \alpha_{u,k} \\ \alpha_{v,k} \end{matrix} \right) \sim MVN(0, \Sigma ) \\ \Sigma = \left(\begin{matrix} \sigma_{\alpha,u} & \rho \\ \rho & \sigma_{\alpha,v} \end{matrix} \right)

where \sigma_{\alpha,u}, \sigma_{\alpha,v}, \rho are hyperparameters to be estimated. (in your case you have multiple intercepts and so the multivariate normal will have more dimensions and you will be estimating more correlation hyperparameters).

Contrast to the usual bf(u ~ (1 | ID)) + bf(v ~ (1 | ID)) which would differ in:

\alpha_{u,k} \sim N(0, \sigma_{\alpha,u})\\ \alpha_{v,k} \sim N(0, \sigma_{\alpha,v})

Unless you have a lot of data, the correlations are unlikely to be very tightly constrained and so the implied dependency can be relatively weak. If you had better understanding of the relationships between the individual model components, it is possible, that you could devise a better dependency structure, but the advantage of the correlation is that it is not IMHO making some strong assumptions.

The model as you wrote it looks the way I would write it. I hope it will be possible to fit it to your data. Obviously, checking the model fit afterwards will be important.

With by = Country you are fitting separate covariance matrix (i.e. the \sigma and \rho parameters as shown above) for each level of Country, but will AFAIK not allow the mean of your linear predictor vary by country. This might and might not be reasonable. Note that you can also do gr(ID + Country, id = p) this way you get extra intercept per country (which then lets the mean of the linear predictor to vary by country), but the correlations will be the same for all countries. So both variants add different type of flexibility to your model. Which is more sensible should depend on your background knowledge and/or the type of misfit you get when ignoring Country (if any).

1 Like

I wasn’t thinking about the fact that survival + longitudinal JM goes that extra step of using one as a predictor for the other. That makes sense, though.

I do think that the correlation structure is preferred here since I don’t have any other information about what that structure ought to be. I should end up with n ~ 10,500 (hopefully closer to 11,000 but I’m being conservative), so I’d hope that any true covariance structure there will be recoverable by the model fitting. I don’t, however, have any intuition for the sample size needed to get good estimates in this case since this is a new model for me.

That’s good to know. I’ll have to do some thinking on this to figure out what makes the most sense, but my impulse is that this is the preferable option here.

@martinmodrak thank you immensely for the clarification and help with this! This has been very helpful and eye-opening for me

1 Like

You are very welcome :-)