Recommendations for pooling penalized model with multiple imputation

Damon_C_Roberts · August 2, 2022, 7:25pm

I have a dataset with missing observations so I used mice to perform my multiple imputation.

I ran a model specifying a lasso prior to do a penalized regression with the bayesreg package which is based on rstan.

The recommendation by the author of the package is to:

Currently, missing data is not supported. However, it is possible to impute the data before fitting the model, for example using the mice package. The model can then be fitted to each of the imputed data sets sepately and the resulting posterior draws can be combined.

This seemed simple enough, until it became apparent that I was unsure what the best way to do this pooling would be.

So what I did was I created a for loop to estimate a model for each of my imputed datasets.

data_sets  = 10 # create 10 imputed datasets

anes_2012 =mice::mice(anes_2012, m = data_sets, method = 'rf') #perform the mice imputation

anes_2012 = miceadds::mids2datlist(anes_2012) #convert my mice mids object to a list for the for loop

anes_2012_models = list() #initialize an empty list called anes_2012_models

for(d in 1:data_sets){
   model = bayesreg(wid ~ rboff + unpast + ecfamily + inflwhite + govbias + whitediscrim + pid + female + age + edu, data = anes_2012[[d]], prior = 'lasso') # for each of the imputed datasets in anes_2012, run this model with a lasso shrinkage parameter
   anes_2012_models[[d]] = model # save the results of each of these models in the anes_2012_models by creating a new list element
)

But from here, I wasn’t sure what the best approach would be. Should I extract my estimates of beta and sigma2, calculate the median for the given model, then average across the models? How then do I present something like pooled credible intervals?

Is there a way to do all of this with the tidybayes package?

If it helps, I’ve included a small subset (n = 50) of my original dataset with the missingness so that you can run quickly the code and see what the anes_2012_models list object looks like.

dput(anes_2012)
structure(list(wid = c(3L, 1L, 2L, 1L, 4L, 1L, 5L, 1L, 3L, 1L, 
2L, 5L, 2L, NA, 3L, NA, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 2L, 5L, 1L,
4L, 4L, 2L, 4L, 5L, 4L, 2L, 4L, 1L, 2L, 1L, 3L, 3L, 2L, 3L, NA,
3L, 3L, 3L, 2L, 1L, 2L, NA, 5L), female = c(0L, 0L, 0L, 1L, 1L,
1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L,
1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L), rboff = c(1L,
1L, 1L, -1L, -1L, 0L, -1L, 0L, 2L, -1L, 1L, -2L, -2L, 1L, 0L,
1L, -1L, -2L, 1L, -1L, -1L, -2L, -1L, -1L, -1L, 2L, 0L, 1L, 1L, 
0L, -1L, -1L, -2L, 2L, -1L, 0L, 0L, 0L, 1L, -1L, 1L, 1L, 1L,
0L, 0L, -1L, -1L, -1L, 1L, 1L), pboff = c(1L, 0L, 0L, 0L, 0L,
-1L, 1L, 0L, 2L, 2L, 2L, 0L, -1L, 0L, 0L, 0L, 2L, 0L, 0L, 2L,
NA, 0L, 0L, 1L, -1L, 0L, NA, 1L, 1L, 0L, 1L, 1L, 2L, 2L, 0L,
0L, 1L, 1L, 0L, 0L, NA, 0L, 0L, 2L, 0L, 2L, -1L, 0L, 1L, 1L),
    ideo = c(6L, 4L, 5L, 3L, 4L, 6L, 6L, 6L, NA, NA, 6L, NA,
    3L, NA, 5L, 4L, 6L, 5L, NA, 5L, 3L, 6L, 6L, 5L, 6L, 5L, 6L,
    NA, 4L, 2L, 7L, 1L, 6L, 4L, 2L, 4L, 5L, 5L, NA, NA, 3L, NA,
    2L, 2L, 3L, 6L, 7L, 6L, NA, 5L), epast = c(-1L, 0L, 1L, 0L,
    0L, -1L, 0L, -2L, 1L, -1L, 0L, -2L, 0L, 0L, -1L, 0L, -1L,
    -2L, 0L, 0L, 1L, 0L, -1L, -2L, -2L, 1L, 1L, 0L, 1L, 1L, -2L,
    -1L, 0L, 0L, 1L, 0L, 0L, -2L, -1L, -1L, 1L, 1L, 0L, 0L, 0L,
    -1L, -2L, 0L, -1L, 2L), efuture = c(0L, 1L, 1L, 1L, -1L,
    1L, 1L, 0L, 2L, 0L, 0L, -2L, -2L, 1L, 0L, 0L, 0L, -1L, 0L,
    1L, 1L, 0L, 0L, 0L, -2L, 1L, NA, 1L, 1L, 1L, -2L, 0L, 0L,
    -1L, 0L, NA, 0L, 0L, -1L, 0L, NA, 0L, 0L, 0L, 1L, 0L, -2L,
    1L, -1L, 1L), unpast = c(-1L, 1L, 1L, -1L, -1L, -2L, -1L,
    -2L, 1L, -2L, 0L, -1L, -1L, -1L, 2L, -1L, -1L, -2L, -2L,
    0L, 0L, -2L, -1L, -1L, -1L, 0L, 0L, 0L, 1L, 1L, -2L, -2L,
    0L, -2L, 0L, 1L, 1L, -1L, -2L, -1L, 1L, 1L, 1L, 0L, 1L, -1L, 
    -2L, 0L, -2L, 1L), unnext = c(1L, 0L, 1L, 2L, 1L, 1L, 0L,
    2L, 0L, 1L, 0L, 0L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 0L, NA, 1L,
    1L, 0L, 2L, 1L, NA, 0L, 0L, 0L, 2L, 0L, 1L, 1L, 1L, 1L, 1L,
    1L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 2L, 2L, NA, 0L, 2L), pid = c(6L,
    5L, 5L, 2L, 4L, 7L, 6L, 6L, 1L, 6L, 7L, 3L, 2L, 1L, 4L, 2L,
    5L, 7L, 4L, 3L, 1L, 5L, 6L, 7L, 5L, 6L, 6L, 4L, 2L, 1L, 7L,
    1L, 7L, 5L, 1L, 4L, 6L, 5L, 4L, 7L, 2L, 4L, 1L, 7L, 2L, 5L,
    7L, 7L, 3L, 2L), age = c(49L, -2L, 47L, 64L, 56L, 59L, 61L,
    51L, 61L, 34L, 25L, 34L, 29L, 85L, 77L, 51L, 53L, 83L, 56L,
    53L, 51L, 64L, 32L, 30L, 62L, 30L, 42L, 30L, 65L, 77L, 39L,
    49L, 49L, 24L, 80L, 23L, 43L, 53L, 29L, 40L, 18L, 52L, 63L,
    -2L, -2L, 28L, 56L, 84L, 51L, 34L), edu = c(5L, 4L, 5L, 3L,
    3L, 4L, 5L, 5L, 2L, 1L, 3L, 2L, 2L, 1L, 1L, 3L, 3L, 3L, 1L,
    3L, 4L, 4L, 2L, 4L, 4L, 3L, 3L, 3L, 4L, 5L, 1L, 1L, 4L, 3L,
    NA, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 5L, 4L, 4L, 3L, 2L, 3L, 1L,
    1L), findjob = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 5L,
    NA, NA, NA, NA, NA, NA, NA, 2L, NA, NA, NA, NA, NA, NA, NA,
    NA, NA, NA, NA, NA, 4L, NA, NA, NA, NA), losejob = c(3L, 
    2L, 1L, NA, NA, 1L, 1L, 1L, NA, NA, 1L, NA, 2L, NA, NA, 1L,
    3L, 1L, 1L, 3L, NA, 5L, 2L, NA, NA, 1L, 2L, NA, NA, NA, 3L,
    NA, NA, 2L, NA, 1L, 3L, NA, 1L, 2L, 2L, 1L, 2L, NA, 2L, NA,
    3L, NA, 3L, 1L), offwork = c(1L, 0L, 0L, NA, NA, 0L, 0L,
    0L, NA, NA, 0L, NA, 0L, NA, NA, 0L, 0L, 0L, 0L, 1L, 0L, NA,
    0L, NA, NA, 0L, 0L, NA, NA, NA, 0L, NA, NA, 0L, NA, 0L, 0L,
    NA, 0L, 0L, 1L, 0L, 0L, NA, 0L, NA, 0L, NA, 0L, 0L), fairjblacks = c(-1L,
    -2L, -1L, NA, -2L, -2L, NA, -1L, -2L, 2L, -2L, -2L, 2L, NA,
    2L, NA, -2L, -2L, -2L, -2L, -1L, -2L, 2L, -2L, -2L, 1L, -1L,
    2L, NA, -2L, -2L, 2L, -2L, -1L, 2L, 1L, -1L, -2L, -1L, -2L,
    2L, NA, 2L, 1L, 1L, -1L, -2L, -1L, NA, NA), immtakejobs = c(2L,
    3L, 2L, 2L, 1L, 3L, 3L, 3L, 2L, 4L, 1L, 4L, 3L, NA, 1L, NA,
    3L, 1L, 3L, 2L, 3L, 2L, 2L, 1L, 3L, 2L, 1L, 4L, 1L, 2L, 4L,
    1L, 2L, 2L, 4L, 2L, 2L, 4L, 2L, 2L, 3L, NA, 1L, 2L, 2L, 3L,
    2L, 3L, NA, 2L), ecfamily = c(1L, -1L, -2L, -1L, 1L, 0L,
    0L, 1L, 2L, 1L, 0L, 0L, 2L, NA, -2L, NA, -2L, 0L, 1L, 1L,
    2L, 1L, -1L, 2L, 0L, -2L, 0L, -1L, -1L, -2L, 0L, 0L, -1L,
    -1L, -2L, -1L, 1L, 0L, 0L, 0L, 1L, NA, 0L, -1L, 2L, -1L,
    0L, 0L, NA, 2L), linked = c(0L, 1L, 1L, NA, 0L, 1L, 1L, 1L, 
    0L, 0L, 0L, 1L, 0L, NA, 0L, NA, 1L, 0L, 0L, 1L, 1L, 1L, 1L,
    0L, 1L, NA, 1L, NA, 1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, NA, 1L,
    NA, 0L, 0L, NA, 1L, 0L, 1L, 1L, 0L, 1L, NA, NA), tradbreak = c(1L,
    1L, 1L, 1L, 1L, 2L, NA, 2L, 1L, 0L, -2L, 1L, 1L, NA, 2L,
    NA, 2L, 2L, 1L, 0L, 1L, 2L, 0L, 2L, 2L, -1L, 2L, 2L, 2L,
    -1L, 1L, -2L, -2L, -2L, 0L, 1L, -1L, 2L, 0L, 0L, 1L, NA,
    -2L, -1L, -1L, 1L, 2L, 1L, NA, 2L), toofar = c(-1L, 2L, 2L,
    -1L, -1L, 1L, NA, 2L, -2L, -2L, 1L, 1L, 0L, NA, 0L, NA, -1L,
    0L, 1L, -2L, -2L, 1L, 0L, 2L, 2L, -2L, 1L, -1L, -1L, -1L,
    2L, -2L, -1L, 0L, 1L, -2L, 1L, -1L, 0L, 0L, 2L, NA, -2L,
    1L, -2L, 0L, 2L, -1L, NA, -1L), govbias = c(0L, 0L, 0L, 0L,
    0L, 0L, NA, 1L, 0L, 0L, 0L, 0L, 0L, NA, 0L, NA, 0L, 0L, 0L,
    0L, 0L, 0L, 1L, 0L, 1L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
    0L, 0L, 0L, 1L, 0L, 0L, 0L, NA, 0L, -1L, 0L, 0L, 1L, NA,
    NA, 0L), inflwhite = c(0L, 0L, -1L, 0L, 1L, 0L, 0L, 0L, 0L,
    0L, 0L, 0L, 0L, NA, 0L, NA, 0L, 0L, -1L, 1L, 0L, 0L, -1L,
    0L, -1L, 1L, 0L, 0L, 0L, 0L, -1L, 0L, 0L, 0L, 0L, 0L, 0L,
    0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L, NA, 0L), inflblack = c(0L,
    0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, NA, 0L, NA, 
    0L, -1L, 0L, -1L, 0L, 1L, 1L, 0L, 1L, -1L, 0L, 0L, -1L, 0L,
    1L, 0L, -1L, 0L, 0L, -1L, 0L, 1L, 0L, 0L, 0L, NA, 0L, 0L,
    0L, 0L, 1L, 0L, NA, 0L), whitediscrim = c(-2L, -2L, 1L, -1L,
    0L, -1L, -2L, 0L, 0L, -2L, -1L, 2L, -1L, NA, -1L, NA, 0L,
    -2L, 0L, -1L, -1L, 0L, 0L, -1L, 2L, -1L, -1L, -2L, -1L, -2L,
    1L, -2L, -1L, -1L, -1L, -1L, -1L, 0L, -2L, -1L, -2L, NA,
    -2L, -1L, -1L, 0L, -2L, 0L, NA, -2L)), class = "data.frame", row.names = c(NA,
-50L))

I’m super new to rstan and to doing it with penalized regression so my apologies if this is a dumb question.

Bob_Carpenter · August 10, 2022, 7:03pm

The original quote from the package is the standard approach:

That is, impute the data, and for each imputed data set, generate posterior draws. Then just throw all those draws into one big pile for inference. So you should be able to read in as data frames and rbind. But then that may not leave things in a form that our posterior analysis packages can handle.

No idea. But @mjskay wrote the package, so he should know.

mjskay · August 10, 2022, 9:29pm

I’ll follow @Bob_Carpenter’s suggestion to throw all the draws together, and I’ll use the example code you provided to show how to do this with the beta parameters.

First, it helps to see how to do it with one model. The betas from one model look like this:

> str(anes_2012_models[[1]]$beta)
 num [1:10, 1:1000] 0.0398 -0.0173 0.087 -0.1759 0.31 ...

The dimension for iterations (1:1000) is last, which is the opposite of how {tidybayes} and {posterior} expect matrices of draws to be structured. So, we have to transpose this matrix with t() to move that dimension to the front. Given the transposed matrix, you can use either tidybayes::tidy_draws() or posterior::as_draws_df() to convert into a data frame of draws; e.g.:

> posterior::as_draws_df(t(anes_2012_models[[1]]$beta))
# A draws_df: 1000 iterations, 1 chains, and 10 variables
     ...1    ...2     ...3   ...4   ...5    ...6    ...7   ...8
1   0.040 -0.0173  0.08698 -0.176  0.310  0.0821 -0.0243 -0.055
2  -0.073 -0.0982 -0.04095 -0.229 -0.669  0.1062 -0.0385 -0.115
3   0.091 -0.0411  0.04597 -0.182  0.301  0.2498 -0.1360 -0.059
4  -0.077 -0.0164  0.04660  0.072  0.089 -0.0632  0.0055 -0.342
5  -0.067  0.0141 -0.00810  0.213  0.803 -0.0118 -0.0289 -0.048
6  -0.071 -0.0266  0.09155  0.046 -0.148 -0.0174  0.0114 -0.066
7  -0.045 -0.0702  0.00095 -0.197 -0.200  0.0243  0.0332 -0.044
8  -0.098  0.0047  0.04084 -0.520  0.262 -0.0028 -0.0028  0.012
9  -0.189 -0.0459 -0.06672 -0.195 -0.238  0.2645 -0.1209 -0.346
10  0.100 -0.0017  0.08825  0.156 -0.169  0.0696  0.0050 -0.159
# ... with 990 more draws, and 2 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

As for throwing all the draws together from each model, there are plenty of ways to do it, depending on your comfort with different data structures. If you are comfortable with manipulating matrices in R, I’d probably just bind all the matrices together upfront and then convert to a data frame. Something like this:

> betas_matrix = do.call(cbind, lapply(anes_2012_models, `[[`, "beta"))
> str(betas_matrix)
 num [1:10, 1:10000] 0.0398 -0.0173 0.087 -0.1759 0.31 ...

(If you haven’t seen something like lapply(anes_2012_models, "[[", "beta") before, this says to create a new list that is the result of doing x[["beta"]] for every x in anes_2012_models.)

Given betas_matrix, you then do tidybayes::tidy_draws(t(betas_matrix)) or posterior::as_draws_df(t(betas_matrix)) to get the data frame version.

Alternatively, you could convert to a list of data frames first, and then bind them together. If you do that, I would use posterior::bind_draws() instead of rbind() to do it, because you need to re-number the .draw column in order for the result to be usable by some packages (e.g., {bayesplot}). So while you could do do.call(rbind, betas_dfs) as the second step and have it mostly work, the slightly more complicated second step below is safer:

> betas_dfs = lapply(anes_2012_models, function(m) posterior::as_draws_df(t(m$beta)))
> betas_df = do.call(posterior::bind_draws, c(betas_dfs, list(along = "draw")))
> betas_df
# A draws_df: 10000 iterations, 1 chains, and 10 variables
     ...1    ...2     ...3   ...4   ...5    ...6    ...7   ...8
1   0.040 -0.0173  0.08698 -0.176  0.310  0.0821 -0.0243 -0.055
2  -0.073 -0.0982 -0.04095 -0.229 -0.669  0.1062 -0.0385 -0.115
3   0.091 -0.0411  0.04597 -0.182  0.301  0.2498 -0.1360 -0.059
4  -0.077 -0.0164  0.04660  0.072  0.089 -0.0632  0.0055 -0.342
5  -0.067  0.0141 -0.00810  0.213  0.803 -0.0118 -0.0289 -0.048
6  -0.071 -0.0266  0.09155  0.046 -0.148 -0.0174  0.0114 -0.066
7  -0.045 -0.0702  0.00095 -0.197 -0.200  0.0243  0.0332 -0.044
8  -0.098  0.0047  0.04084 -0.520  0.262 -0.0028 -0.0028  0.012
9  -0.189 -0.0459 -0.06672 -0.195 -0.238  0.2645 -0.1209 -0.346
10  0.100 -0.0017  0.08825  0.156 -0.169  0.0696  0.0050 -0.159
# ... with 9990 more draws, and 2 more variables
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}

Then you should be able to use posterior::summarise_draws() on the result:

> posterior::summarise_draws(betas_df)
# A tibble: 10 × 10
   variable      mean    median      sd     mad       q5     q95  rhat ess_bulk ess_tail
   <chr>        <dbl>     <dbl>   <dbl>   <dbl>    <dbl>   <dbl> <dbl>    <dbl>    <dbl>
 1 ...1     -0.00789  -0.00482  0.0911  0.0709  -0.164   0.138    1.00    1946.    9397.
 2 ...2     -0.00818  -0.00400  0.0899  0.0687  -0.158   0.134    1.00    3010.    8798.
 3 ...3      0.0328    0.0227   0.0865  0.0703  -0.0962  0.186    1.00    1007.    7968.
 4 ...4     -0.121    -0.0857   0.265   0.213   -0.609   0.267    1.00    8863.    9647.
 5 ...5     -0.0142   -0.00835  0.285   0.224   -0.491   0.444    1.01     562.    7651.
 6 ...6      0.0745    0.0544   0.116   0.0962  -0.0832  0.296    1.00    1742.    7226.
 7 ...7     -0.0325   -0.0234   0.0549  0.0451  -0.137   0.0434   1.01    3517.    8856.
 8 ...8     -0.212    -0.166    0.255   0.225   -0.695   0.121    1.00    1011.    6938.
 9 ...9     -0.000628 -0.000369 0.00484 0.00384 -0.00911 0.00709  1.00    9439.    7883.
10 ...10    -0.0246   -0.0166   0.0822  0.0661  -0.170   0.100    1.00     640.    9614.

Or pass it into {bayesplot} functions:

bayesplot::mcmc_areas(betas_df)

Or make some visualizations with tidybayes (here we go through gather_variables() to get a long-format data frame):

betas_df |>
  gather_variables() |>
  ggplot(aes(x = .value, y = .variable)) +
  # normalize density within groups b/c variance is quite different across variables
  # otherwise you can't see the densities for anything but `...9`
  stat_halfeye(normalize = "groups")

Damon_C_Roberts · August 12, 2022, 1:44pm

This is great! Thank you @Bob_Carpenter and @mjskay.

I’m somewhat new to Bayesian models and wasn’t sure about the proper way to do this.

This is all great. I really appreciate everyone’s time to help me out with this.

Topic		Replies	Views
MICE missing values on the response Modeling missing-data	12	2886	April 1, 2021
Multiple imputation vs Full Bayesian approach for handling missing data in Stan General missing-data	3	2963	July 20, 2021
How to pool BRMSfit imputed data, Mice not appropriate General brms	4	932	May 12, 2021
How to pool Bayesian R-Squared values in multiple imputation? Modeling techniques	3	642	September 11, 2020
Modeling missing discrete covariates in regression model? Modeling specification , discrete-parameters	8	137	February 10, 2025

Recommendations for pooling penalized model with multiple imputation

Related topics