Brms: mi() for discrete outcomes in an IRT model

Just for transparency and clarification, the initial post was about using mi() for missing outcome variables in a non-linear model. That ultimately is not an issue; instead, it’s that the missing outcome is a discrete variable, which Stan takes issue with. The original question has been revised to reflect more relevant issues.

I have item-level data for several different cognitive tests that I’d like to fit with IRT models. As with any dataset, there are missing data. Some of those missing data are from people just not being given the test, but a smaller proportion are from people refusing to answer a particular item. In the case of people not administered a test, I don’t really care about their missing data; however, for those who refused to answer an item, I’d like to be able to use the rest of their item responses to impute what their response might have been.

Similarly, there are some missing data points corresponding to someone answering “don’t know” to an item. Traditionally, these missing points are treated as errors, but I suspect that they’re better accounted for by a separate data generation process that I’d like to ideally model alongside the imputation of refusals and model-fitting for those with complete data.

The simplest case for the model formula with some missing outcomes would be the following:

formula <- bf(response | mi() ~ beta + exp(logalpha)*theta,
              theta ~ 0 + (1 | ID),
              beta ~ 1 + (1 |i| Item),
              logalpha ~ 1 + (1 |i| Item),
              nl = TRUE)

In essence, the goal is to estimate the parameters of the IRT model and use that as a way of predicting what the missing data for a particular person would be; however, the missing data are discrete variables (0/1) and thus not Stan-friendly with the out-of-the-box mi() function. I know there are examples for estimating discrete parameters on this forum and in the Stan documentation; however, the mathematical details of specifying a model that marginalizes out the discrete parameters are beyond my applied statistics skills.

Related to my thoughts on modeling a separate data generation process for missing due to “don’t know” versus refusals, I don’t have a good sense for how to approach the problem. My thought was some kind of mixture model, but I don’t know how to signal to Stan/brms that I want to impute missing data conditionally on the kind of “missingness”.

If it helps at all, what I hypothesize is that those who have “don’t know” responses represent different groups of individuals. Some of those are genuine “no clue, can’t tell you” answers that should be counted as errors, but some are that are “I don’t want to think too hard about this and just want to get out of here” answers that might not reflect true ability. If I can specify this as a mixture model, then I’ve got a multinomial/categorical likelihood for the mixtures (responds, refuses and doesn’t know, and refuses but could know), and I have some predictors of those group memberships (e.g., depression, subjective cognitive complaints, dementia worry, etc).

Desired Solutions
Hopefully this detail is enough to understand the modeling problem that I’m facing, and I’m open to any and all solutions that the community may have. A solution that works within brms is ideal, but I’m fine as well just moving to straight Stan. I’m trying to avoid learning BUGS or JAGS just for this case, though perhaps the discrete parameters will render that an impossibility

1 Like

are you sure the formula you are showing is not working? I think the thread you linked to referrerd to mi() occurring on right hand sides of formulas not on left hand sides.

I haven’t actually tried the formula yet since I was still working on wrapping my head around the modeling problem. While I was looking through documentation on the mi() function, I just saw that it didn’t work in non-linear models, but I wasn’t paying any attention to left versus right sides of the formulas.

From the prior discussion I linked to, I would’ve thought that the mi() function would work for my case since I’m asking for imputation from a linear element, but again, I also wasn’t paying attention to which side of the formula I was looking at and just kept seeing that non-linear was the problem.

I’ve got another model running right now, but I’ll try it once that finishes up. Seems like maybe that’s not an issue then (though I’ll confirm). Any thoughts on the mixture piece by chance?

Well, mi() doesn’t work but just not for the reason I was anticipating. The error I get when attempting is that mi() is not supported for Bernoulli/binomial likelihoods

The main thing to be aware of is that mi() treats the missing data as a parameter to be estimated, and Stan (well, HMC) does not support discrete parameters

True: I forget about that issue until situations like this. Is that still true when a different estimator is used (e.g., using CmdStanR’s variational Bayes)?

In this case, the imputation is the product of a logistic regression, so I suppose that rounding the resulting logit could be used to guess the missing values. I recall running into this same problem with a latent profile model I was trying to write in Stan a year or so ago. I seem to recall the need to use log_sum_exp(...) over the various class distributions (or maybe it was log_sum(...) that I had use - I don’t recall now). I’m not sure whether that solution would work for imputing missing values in a binary variable, though

I’m 99% sure that the following code is not going to work(it does at least compile), but I’m trying to wrap my head around how to modify the model to do what I’m looking for. The primary inspiration for imputing the missing values comes from this post on the discussion board. The meat of that solution to marginalizing out the missing binary outcome variable was including a loop over the missing values like this:

for(n in 1:N_missing) {
  target += log_mix(Ymi[n], bernoulli_logit_lpmf(1 | mu),
                            bernoulli_logit_lpmf(0 | mu));

Where Ymi is a vector of the probabilities that the missing data are 1 and (in my case) mu is the result of the item response model (i.e., beta + exp(logalpha) * theta). My hope is that I can use the IRT parameters estimated from those with observed data can be used to estimate the probability that a missing response is 0/1. I think this is helped by the fact that no one is missing responses to all the items, so there is still an opportunity to learn something about the latent trait from the partial item responses and then use that information to inform estimation of the missing responses. I understand that the mirt package is able to generate estimates for IRT models with missing data using full-information maximum likelihood, so I know there’s a way of approaching this problem – just not sure how to implement a Bayesian and Stan-friendly version of it.

Additionally, instead of trying to build in a mixture model to reflect my beliefs about different causes of missingness, I wanted to try to build in the prediction of the Ymi proportion via a beta regression. I’ve never written a beta regression in Stan, so I’m not anywhere close to confident in my implementation here. Additionally, I don’t know that a beta regression is going to really improve any estimation, but my desire is to test whether certain covariates (e.g., depression) might be related to the missingness and inform performance estimation. That’s the idea there, at least.

// generated with brms 2.15.0
functions {
  /* compute correlated group-level effects
  * Args: 
    *   z: matrix of unscaled group-level effects
  *   SD: vector of standard deviation parameters
  *   L: cholesky factor correlation matrix
  * Returns: 
    *   matrix of scaled group-level effects
    matrix scale_r_cor(matrix z, vector SD, matrix L) {
      // r is stored in another dimension order than z
      return transpose(diag_pre_multiply(SD, L) * z);

data {
  int<lower=1> N;  // total number of observations
  int Y[N];  // response variable
  int<lower=0> Nmi;  // number of missings
  int<lower=1> K_beta;  // number of population-level effects
  matrix[N, K_beta] X_beta;  // population-level design matrix
  int<lower=1> K_logalpha;  // number of population-level effects
  matrix[N, K_logalpha] X_logalpha;  // population-level design matrix
  int<lower=1> K;  // number of population-level effects for beta regression component
  matrix[N, K] X;  // population-level design matrix for beta regression component
  // data for group-level effects of ID 1
  int<lower=1> N_1;  // number of grouping levels
  int<lower=1> M_1;  // number of coefficients per level
  int<lower=1> J_1[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_1_theta_1;
  // data for group-level effects of ID 2
  int<lower=1> N_2;  // number of grouping levels
  int<lower=1> M_2;  // number of coefficients per level
  int<lower=1> J_2[N];  // grouping indicator per observation
  // group-level predictor values
  vector[N] Z_2_beta_1;
  vector[N] Z_2_logalpha_2;
  int<lower=1> NC_2;  // number of group-level correlations
  int prior_only;  // should the likelihood be ignored?

transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc; // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i-1];

parameters {
  vector<lower=0,upper=1>[Nmi] Ymi; // probability that missing response is 1
  vector[K_beta] b_beta;  // population-level effects
  vector[K_logalpha] b_logalpha;  // population-level effects
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  matrix[M_2, N_2] z_2;  // standardized group-level effects
  cholesky_factor_corr[M_2] L_2;  // cholesky factor of correlation matrix
  vector[Kc] b;  // population-level effects for beta regression component
  real Intercept;  // temporary intercept for centered predictors of beta regression component
  real<lower=0> phi;  // precision parameter for beta regression component

transformed parameters {
  vector[N_1] r_1_theta_1;  // actual group-level effects
  matrix[N_2, M_2] r_2;  // actual group-level effects
  // using vectors speeds up indexing in loops
  vector[N_2] r_2_beta_1;
  vector[N_2] r_2_logalpha_2;
  r_1_theta_1 = (sd_1[1] * (z_1[1]));
  // compute actual group-level effects
  r_2 = scale_r_cor(z_2, sd_2, L_2);
  r_2_beta_1 = r_2[, 1];
  r_2_logalpha_2 = r_2[, 2];

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] nlp_theta = rep_vector(0.0, N);
    // initialize linear predictor term
    vector[N] nlp_beta = X_beta * b_beta;
    // initialize linear predictor term
    vector[N] nlp_logalpha = X_logalpha * b_logalpha;
    // initialize non-linear predictor term
    vector[N] mu;
    // initialize linear predictor term
    vector[Nmi] beta_mu = Intercept + Xc * b;
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_theta[n] += r_1_theta_1[J_1[n]] * Z_1_theta_1[n];
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_beta[n] += r_2_beta_1[J_2[n]] * Z_2_beta_1[n];
    for (n in 1:N) {
      // add more terms to the linear predictor
      nlp_logalpha[n] += r_2_logalpha_2[J_2[n]] * Z_2_logalpha_2[n];
    for (n in 1:N) {
      // compute non-linear predictor values
      mu[n] = nlp_beta[n] + exp(nlp_logalpha[n]) * nlp_theta[n];
    for(n in 1:Nmi) {
      // computing mixing ratios
      beta_mu[n] = inv_logit(beta_mu[n]);
    target += beta_lpdf(Ymi | beta_mu * phi, (1 - mu) *phi);
    for(n in 1:Nmi) {
      // handle missing outcomes
      target += log_mix(Ymi[n], bernoulli_logit_lpmf(1 | mu),
                                bernoulli_logit_lpmf(0 | mu));
    target += bernoulli_logit_lpmf(Y | mu);
  // priors including constants
  target += student_t_lpdf(sd_1 | 3, 0, 2.5)
  - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(z_1[1]);
  target += student_t_lpdf(sd_2 | 3, 0, 2.5)
  - 2 * student_t_lccdf(0 | 3, 0, 2.5);
  target += std_normal_lpdf(to_vector(z_2));
  target += lkj_corr_cholesky_lpdf(L_2 | 1);
  target += student_t_lpdf(Intercept | 3, 0, 2.5);
  target += gamma_lpdf(phi | 0.01, 0.01);

generated quantities {
  // actual population-level intercepts for beta regression component
  real b_Intercept = Intercept - dot_product(means_X, b);
  // compute group-level correlations
  corr_matrix[M_2] Cor_2 = multiply_lower_tri_self_transpose(L_2);
  vector<lower=-1,upper=1>[NC_2] cor_2;
  // extract upper diagonal of correlation matrix
  for (k in 1:M_2) {
    for (j in 1:(k - 1)) {
      cor_2[choose(k - 1, 2) + j] = Cor_2[j, k];

I want to make sure that conceptually what I’ve outlined above for handling the missing outcome variables is reasonable or sensible within Stan and using IRT data. I have a sense that what I want is possible, but I’m an applied researcher and the technical realities of identifiability and implementation are often hard for me to recognize until I’ve played with the models.

I’d also love to know what in my code is wrong. I tried to adapt everything to be in line with brms language and standards, but some of the things that brms does to improve efficiency are just well beyond my own Stan understanding. Ideally, I’d like to be able to run this with a QR decomposition as well, but my first priority is getting the general framework up and running. Personally, I think that being able to handle missing item responses in IRT applications would be a great option for those of us interested in Bayesian IRT with Stan, so I hope that a solution to this problem will be of use to people other than myself.

1 Like

I fear you are trying to achieve an unreasonable goal. If you have missing data in response, I don’t think you can do better than to not use those rows for fitting and “impute” the responses afterwards via model predictions. See Imputing missing responses in multivariate ordinal data for a discussion of this. Feel free to ask for clarifications here,if the thread is hard to follow.

Best of luck with your model!

1 Like

I think I generally understand the problem from the thread, and I initially was planning to do exactly as you said: run the data in long-format and let brms drop the rows with NA for the response variable with the posterior always being available to me to impute missing responses. As I thought about the problem and learned more about the data (this is a 3rd party dataset, so I’ve been reaching out and asking some clarification things to learn more about the documentation), I started to think that the types of missingness were non-ignorable/non-equivalent and that I might have some additional information within the data to inform the missing not at random process.

I know that I ran through these thoughts at the top of the discussion, but maybe it’s possible for someone to tell me where in this logic my intuition is off. Basically, I have item responses for 30 questions on a test of general cognitive ability from 4 different countries. Responses to items are recorded as either correct (1), incorrect (0), not administered (.na), refusal to attempt (.r), and refusal to guess (.dk). For clarification, items were not administered (.na) if the person had a physical limitation to completing the item (e.g., unable to read, unable to hold a pencil). A refusal to attempt an item meant the person just outright refused to do anything for the item while a refusal to guess was coded when a person responded with a “don’t know” response and refused to provide an alternative response.

From a data generation perspective, as I thought about the different response processes, I increasingly felt like relying on the estimated item parameters and participant’s partial response patterns was missing these other factors. I considered adding in person-level covariates related to estimating the type of response (i.e., correct vs incorrect vs missing due to X reason), but that would ultimately require that missing values are included in the model and a different likelihood specification than the Bernoulli. Additionally, I don’t particularly want to know what predicts the kind of missingness per se; instead, I’d prefer to know what someone’s response to an item might be given the kind of missingness they have plus other information we know about the person.

From there, I started thinking about whether I could use other information from my clinical experience to make better guesses about the ability of individuals to answer some items that are coded as missing due to different reasons. Clinically, we expect “don’t know” responses more commonly in people with certain background variables (e.g., higher depression scores, more fatigue, older subjective age, etc), and when we interpret these cognitive tests, we are often thinking whether or not someone who says “don’t know” actually has the ability to answer that item correctly or not. IRT provides a very direct way of informing the probability of a correct response (i.e., the theta parameter), so it seemed like a natural way of “imputing” the missing response.

Still, that doesn’t utilize any information about the kinds of missing responses that I have in the data. As I noted in my initial post, I thought about using a mixture model to essentially get some kind of latent classes (e.g., a class of “don’t know” and refusers who are expected to answer correctly versus another class expected to answer incorrectly). When I found the other post on handling missing discrete responses, I thought maybe a beta regression predicting the log_mix ratio could be a more efficient way of getting the same information. Basically, I’d plug in these clinical variables that we believe to be informative about the kinds of responses to get the log_mix ratio and then get a more direct estimate of the missing response via the IRT parameters estimated from the sample + the participant’s partial response pattern.

As I’ve thought about this, if items were just missing because they weren’t administered due to physical limitations, then I’d feel fine just using the posterior to “impute” the responses (if I was ever interested in those values). The fact that there are then the .r and .dk missing responses is where the wrench got thrown in since my clinical experience tells me that this is not missing at random and that I have information that can be used to flesh out that missing data generation. Perhaps the issue then is that I’m trying to make a model that will reflect the data generation process a little better but at no additional benefit to inference, validity, or utility?

To me this looks as you don’t really have missing data. The data are complete, they just have a more complex structure than just the response.

That actually makes a lot of sense to me. You could either fit a categorical regression taking all the 5 categories as options, or (maybe even better) fit a 4-category categorical regression to predict the type of response (guessed, .na, .r, .dk) and then a model with binary response, just for the subset where the subjects actually guessed. You should be able to combine those into a multivariate model and enforce a correlation structure between various predictors (feel free to ask if you are unsure how to achieve those in brms).

I think a correlation structure between predictors could achieve this goal. E.g. if you put a (1 | gr(patient_id, id = p)) as a predictor for both parts of the model, you will get an estimate of the correlation between a varying intercept for each category and varying intercept for correct response, e.g. a correlation between “tendency of person to respond correctly” and “tendency of person to refuse to guess”.

Would that make sense?

1 Like