Guidelines for Practical Imputation with Stan?

I’ve been trying to wrap my head around how to best go about multiple imputation with Stan code, and I’ve read a number of different posts and vignettes on this forum, StackExchange, and various GitHubs. While all helpful, I have found that there is a fair amount of variability in recommendations and resources, but these recommendations tend to be very use specific while tutorials often focus on cases where only one variable has missing data or where all the missing data can be modeled via a multivariate normal distribution.

My hope in creating this thread is to get a sense for what the community thinks of with respect to implementing imputation models in Stan from problem conceptualization to implementation and verification. If such resources for a workflow already exist, then I’d greatly appreciate any links or recommended readings as I haven’t found anything quite as dedicated to the topic as I seemingly need in order to feel comfortable write a Stan model to accommodate a realistically complicated missing data scenario.

To ground this somewhat, I can share my general situation, though I don’t necessarily mean for this thread to be beholden to this singular context. I’m running item response theory models where there are a number of person-level and item-level covariates being used to explain/predict the person and item parameters. The missingness is strictly in predictors with all but a few covariates of interest missing at least some observations. The covariates that are missing include variables that could be modeled via continuous distributions (likely a simple normal or skew normal in some cases) as well as discrete distributions (including multinomial, ordered, and binary). There are auxiliary variables that can be used to help impute missing values in the covariates of interest as well with similar properties of missingness and distributional considerations.

From my general searches, the major recommendations that I’ve seen are listed below. For each approach, I’m including some thoughts/questions/comments to hopefully give some extra context on what isn’t necessarily clear to a very much so applied data analyst with only a very practical appreciation for concepts like marginalizing out a variable or Jacobian adjustments.

  1. Use other software like mice to complete the imputation and then pass the data to the Stan model –

This is a very intuitive solution in my mind as other resources for handling missing data typically have a lot of documentation about how to do so and what arguments to adjust for specific kinds of problems. When a model can be run with a package like brms, this is a fine plug-and-play, just-power-through kind of approach since there are built-in functions to help with all the data cleaning on the backend. Even then, how does one conduct appropriate posterior predictive checks when there are technically multiple datasets being used to fit the model? Another concern here is just the time cost of Bayesian analyses: how can this be made efficient so that models that may take many hours to fit on a single dataset aren’t taking days (or longer) to run multiple times over imputed datasets?

  1. Treat missing variables as parameters to be estimated and impute values for the missing values of each variable –

This seems to be the more common of the general tutorials and guidances that I have been able to find, and the documentation for these approaches is pretty good overall. When there are tutorials for estimating missing discrete variables, they tend to focus on the binary case, and it isn’t always clear (at least to me) how those methods would generalize to categorical or ordered cases. Additionally, when missing values are imputed essentially one variable at a time, it would seem highly inefficient. In essence, a multivariable formula is being used to estimate the linear component of each variable with missingness that is then used for the sampling statement over the combined observed and missing observations vector. When there are multiple covariates with missing data and one is using auxiliary variables in these imputations as well, this seems to very rapidly increase the number of parameters and equations being estimated. Is there a point at which this very straightforward approach is more efficient than something more complicated – e.g., some number of observations by some number of predictors vs auxiliary variables?

  1. Treat missing variables as parameters but impute values from a multivariate model such as with copulas –

This particular topic seemingly has gained a lot of attention on the forum, and the documentation by @spinkney and the development of these methods for the next brms update seemingly have driven this. At present, I’ve seen implementation of a multivariate gaussian copula with mixed data types: normal, binomial, and Poisson (github/spinkney). On the forum, it also seems that an ordered marginal has been added to this functionality (Speed-up for Mixed Discrete-Continuous Gaussian Copula - Reduce Sum?). It seems relatively trivial to model any particular marginal via a gaussian copula when the marginal distribution has a *_cdf definition in Stan, but many of the discrete distributions lack the CDF, meaning that users need to “do the math” to make the copula work. There are also vine copulas, which seem to be restricted to bivariate copulas but offer additional flexibility in terms of the specified relationship between the variables. In the case of multivariate missing data, is there a multivariate extension of the vine copula or does one need to simply estimate each pairwise bivariate vine copula? Since most discrete distributions can be treated as arising from a continuous latent variable, is it more efficient/ever appropriate to model missingness via a multivariate regression on this latent scale and then use the results to get the discrete marginals?

  1. Use gaussian process models to impute the multivariate missing data –

This is relatively new to me as I’ve not really ever used GP models before, but I have seen some literature on its use for data imputation (e.g., here and here). This also seems to have been raised on the forum before (Scalable Gaussian process inference - #8 by tillahoffmann). GPs seem to be very flexible, but are they a reasonable default approach to imputing missing data in Stan? They strike me as computationally intensive and a little black-boxy, though this could be my general lack of familiarity with them. I also don’t believe that I’ve ever seen an implementation of GPs for missing data in Stan, though this also might be a bias in terms of the places I tend to look and the kinds of problems that I encounter in neuropsychology versus big data analysis.

Looking forward to the discussion and recommendations that will hopefully come from this!

5 Likes

I don’t think we have anything specific about this written up in any depth. The general idea is covered in a User’s Guide section, 3.1 Missing data | Stan User’s Guide, but with a very simple example. You can look at what packages like mi do in R ( https://cran.r-project.org/web/packages/mi/index.html )—this is Andrew Gelman and Ben Goodrich’s multiple imputation package, but it’s not based on Stan.

This is a very common situation and covers most if not all of the cases that the mi package handles.

If you do this, you wind up getting inference based on what the BUGS probability programming language called “cut”. The fundamental problem is that this is not a Bayesian method as it “cuts” the flow of information from the Stan model back to the imputation.

This is the full Bayesian approach.

Given a joint model on all of the predictors, missingness is relatively easy to deal with. To deal with categorical or ordered cases, the joint model needs to include components that are categorical or ordered (e.g., multi-logit and ordered logit are described in the User’s Guide).

I thought this is what you meant with the last issue. You definitely need this to be multivariate if you are going to use information about the other predictors in filling in the missingness. Otherwise, the inference has to all be passed through outcomes. Sometimes that can work, but a joint model should be better.

That’s just one particular choice of joint model. It’s convenient in that you can link like in a GLM to transform (e.g., apply exp() to a dimension to constrain to positive or inv_logit to constrain to a probability), and then you can have a data generating process on top of that (e.g., Bernoulli or categorical or ordered logit). The real problem with GPs is scaling—they’re no more a black box than any other regression in that you can control the covariance function.

For overall recommendation, I’d suggest thinking of it in terms of a joint model of all the things that could be missing, but you can start by implementing that as a product of missingess models, one for each data type. That’ll let the outcome filter back to the missingness, but not other predictors. Then see how it works at imputation in cases you simulate and know the answer. The simulations need to look like your real data, though, to be informative.

3 Likes

I really appreciate the time to answer these questions! I’m wondering whether there might be some more concrete examples of potential approaches to dealing with the missingness. Toward this end, I’m going to consider just a relatively straightforward linear regression:

data {
  /* Basic Multivariable Linear Regression Data
   * Values-
   *   N: number of observations/sample size
   *   K: number of predictors used in the model
   *   X: design matrix
   *   Y: outcome/dependent variable
   */
  int<lower=1> N;
  int<lower=1> K;
  matrix[N, K] X;
  vector[N] Y;
}

parameters {
  /* Simple Linear Regression Parameters
   * Values-
   *   beta: regression coefficient estimates
   *   alpha: regression intercept
   *   sigma: predictive error
   */
  vector[K] beta;
  real alpha;
  real<lower=0> sigma;
}

model {
  //priors
  beta ~ std_normal();
  alpha ~ student_t(3, 0, 1);
  sigma ~ student_t(3, 0, 1);

  //likelihood
  y ~ normal_id_glm(X, alpha, beta, sigma);
}

Now, let’s say that there’s two elements of the design matrix with missing observations: one is pseudo-continuous (e.g., [beta-]binomial with enough trials to be reasonably well approximated with as normal or skew normal) and the other is an ordered factor. As far as I have understood the multivariate approach to a situation like this, the code would become something like the following:

data {
  ...
  int<lower=1, upper=K> miss_con;   //column of design matrix with missing continuous data
  int<lower=1, upper=K> miss_ord;   //column of design matrix with missing ordered data
  int<lower=3> levels;              //number of levels in the missing ordered data
}

transformed data {
  ...
   use for loops to create indices of observed and missing observations
     \-> array[# missing continuous values] int con_miss;
     \-> array[# non-missing continuous values] int con_obsv;
     \-> array[# missing ordered values] int ord_miss;
     \-> array[# non-missing ordered values] int ord_obsv;
  ...
}

parameters {
  ...
  /* Multivariate Normal for Missing Data
   * Values-
   *   latents: predictors on a common latent scale
   *   mu_latent: vector of latent averages
   *   sigma_latent: vector of latent standard deviations
   *   L_latent: correlation of the latents
   */
  matrix[N, K] latents;
  vector[K] mu_latent;
  vector<lower=0>[K] sigma_latent;
  cholesky_factor_corr[K] L_latent;

  //accommodate the missing ordered factor
  simplex[levels - 1] c;
}

transformed parameters {
  //design matrix with the missing continuous values filled
  matrix[N, K] X_continuous_filled = X;
  X_continuous_filled[con_miss, miss_con] = latents[con_miss, miss_con];
}

model {
  ...
  //additional priors for the latents
  mu_latent ~ std_normal();
  sigma_latent ~ student_t(3, 0, 1);
  L_latent ~ lkj_cholesky_corr(2);

  //apply the multivariate normal
  latents ~ multi_normal_cholesky(mu_latent, diag_pre_multiply(sigma_latent, L_latent));
 
  //apply observed ordered data
  X[ord_obsv, miss_ord] ~ ordered_logistic(inv_logit(latents[ord_obsv, miss_ord]), c);

  //likelihood
  for ( n in 1:N ) {

    // marginalize the missing discrete ordered predictor
    if (X_continuous_filled[n, miss_ord] == -999) {
      vector[levels] lp_ord;
      for ( i in 1:levels ) lp_ord[i] = ordered_logistic_lpmf( i | inv_logit(latents[n, miss_ord]), c);
      target += log_sum_exp(lp_ord);

    } else {

       y[n] ~ normal(alpha + X_continuous_filled[n, ] * beta, sigma);

    } // end if statement
  } // end for loop over N
}

I don’t know that this code would work, but I think it’s a general illustration of what I was meaning here:

This is a very naive implementation of the multivariate latents for missing data as I understand from the copula posts that I shared initially that a truncated multivariate normal is more appropriate. It seems that the truncated normal is the case in which special functions are needed, which leads to needing a way of specifying the marginals of the multivariate normal copulas for each given discrete distribution that would be of interest. Is this accurate or is there another pathology underlying just a simple MVN as implemented above that makes it inappropriate as a generic method for getting around missing data?

One issue that I see with the MVN example code is that it is making a relatively weird assumption as written: namely, it’s treating the representation of the ordered factor as a single value in the design matrix, which implies no contrast coding is being applied to it. I don’t know that this is a common situation. I think a gentler approach that would be easier for all of us applied folk is something along the lines of what mice does where a set of imputed datasets is produced that can be handled with all the familiar transformation methods of whatever statistical environment is being used rather than handling data transforms completely through Stan.

I wonder whether there is some way of mitigating this issue in some way. The only thing that I can think of involves running the model twice, which can have issues obviously. The gist of this thought, though, is that the first model run is just the standard missing data model with a generated quantities block that returns the predicted missing data values. These generated quantities could then be converted to complete datasets like mice returns. If the model is written to accommodate a set of fitting data and (optionally) a set of test data, then the posterior predicted completed datasets could be used as the “test” data.

I feel like I saw an example not too long ago of a Stan model where the data to both fit the model and then test the fitted model were passed together so that a single model estimation gave both results. I guess this is similar except that one set of data has the missing values used to get the estimates of the model parameters and then the other set of data is the posterior predicted complete datasets from an earlier run of the missing values model with a likelihood like:

... missing data likelihood ...

for ( n in 1:post_estimated_draws ) {
  lp_post_est += (1 / post_estimated_draws) * normal_lpdf( y | alpha + X_post_estimated[n, , ] * beta, sigma);
}

I’m sure that there’s some nuance needed to make sure that there isn’t a crossing of the information and other double-dipping worries, but something to this effect seems like it could be an option for using posterior realizations of missing data (like would be returned from mice) without cutting off the information by fitting the imputed data entirely separately. I’m sure there’s something about this specification, though, that makes it problematic as it seems like this would have been implemented elsewhere if it was this straightforward to replace a dependency on the mice package.

I think that from a

have you seen this paper: Modeling racial/ethnic differences in COVID-19 incidence with covariates subject to non-random missingness

here’s the slides from a talk for a general audience about this paper: https://github.com/mitzimorris/epistan_talk_2023_08_08/blob/main/slides.pdf

the experiments in the paper compare the a fully hierarchical model to current approaches to MI. from a workflow perspective, the question is how to decide which model works best giving the nature and amount of missingness in the data.

1 Like

Thanks for sharing the work! I think that I have seen the paper a few times while searching for methods on this topic, but I hadn’t thought of it as a general approach since the math in the paper is beyond my skill. As I follow, the overall workflow itself is indeed fairly general, but the implementation in the paper seems to be more specific to the data problem at hand (e.g., the direct implementation of the conjugate beta prior to the Poisson likelihood). I think that this kind of work really speaks to the motivation in starting this thread.

There’s definitely different types of Stan users, and I put myself pretty firmly in the “applied” camp where my expertise lies in a particular field and Stan is a highly flexible resource for answering questions within this expertise. Wherever possible, I’ll use brms or rstanarm for my analyses, but eventually, I run into situations where I have to write directly in Stan. In my mind, the code I write is very blunt as I’m effectively just trying to break down the generative process into little modules that I know how to code rather than being able to get everything streamlined. For things like “marginalizing out” a discrete variable, I am heavily dependent on example code, which means that I’m doing a lot of trial and error and Googling whenever I run into a situation that isn’t like what I’ve done before.

As a result, where I can generally follow along with the gist of a paper like yours in the sense of getting more or less what the model is doing mechanistically, there is a large competency gap between unpacking the examples you provide and then implementing appropriate solutions for my own problems. Toward that end, I wonder whether you could share how the approach y’all use in the paper would be written in other situations such as with different likelihoods or varying numbers/types of missing covariates?

1 Like