Fitting a large data hierarchical model-can I trust variational inference algorithm?



I am trying to fit a model with a (two, actually) relatively large (>1.5 million) dataset with a hierarchical model of the type that A. Gelman often suggests to do MRP.

I have found it impossible to run a HMC sampler on it-I killed the job after seeing little progress for more than 1 day. I have tried to use the meanfield algorithm proposed in stan_glmer and it works remarkably well-it’s fast, and it converges, etc.

I wondered, however, if I can trust its estimates -I have seen all kinds of warnings. I have tried to compare its results with less complicated models/smaller samples and it seems to be ok, in general. But for the larger model, I have no benchmark to compare. I can I present these estimates as ok? Are there any diagnostics I can perform? Any workaround you can suggest?


In my experience vb gives poor estimates whenever I use it, but it gives relatively good initial points for HMC.

If the main issue is the size of the dataset, I’d suggest as a way to compare things, to reduce your dataset by a factor of 10 to 100 and run an HMC fit on this, and compare to the vb fit on the same reduced set. If vb gives similar results in your sub-sample, then you have maybe some reason to believe it might be working well in your full sample.

In general, I’d be very clear in any writeup that accuracy of the fit is a concern.


The algorithm = "meanfield" argument to stan_glmer is less bad when the outcome is discrete and when you specify QR = TRUE.


We don’t really have a good workflow for diagnosing these models, I’m afraid.

The best you can do is fit with simulated data that you think is like your data. And for that, you can’t really validate coverage/calibration because we know variational methods underestimate uncertainty systematically.

How many iterations were you trying to run for? It may also depend on the way the problem’s coded. There are a lot of ways efficiency can be improved but of course we have no basis for making suggestions without seeing your program.



I tried some of the suggestions and finally found a way forward in this problem. I looked at how Yajuan Si, Rob Trangucci, Jonah Sol Gabry, and Andrew Gelman proceed in their recent paper and as you can see (pg21) they proceed in 2 steps:

  • They first estimate cell-specific means, and record the size of the cell, the variance, etc.
  • They then fit the model using aggregate estimates as imputs.

This is definitely compatible with my approach, since it drastically reduces the dimensionality of my data (from 1e6observations, to between 6000 and 3000 cells).

However, to introduce the cell size and cell variance in the second step, they use the prior_covariance argument to define random effect, using a function rstanarm::mrp_structured which I believe is not available in my version of rstanarm, Version 2.15.3, packaged: 2017-04-29 06:18:44 UTC, which I believe is the latest one.

How could I define a similar covariance matrix? I have tried to look at the rstanarm manual, but could not understand how the prior_covariance argument is defined.

prior_covariance =
cell_size = dat_rstanarm$n,
cell_sd = dat_rstanarm$sd_cell,
group_level_scale = 1,
group_level_df = 1

Below I include what my code currently looks like:

Observations: 2,907
Variables: 8
$ PROV         <fctr> Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Ála...
$ EDAD         <fctr> 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 30-34, 30-34, 30-...
$ gender_woman <int> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,...
$ edu_cis      <fctr> 1.primary_or_less, 2.secondary_early, 3.secondary_upper, 4.upper, 1.prima...
$ sd_cell      <dbl> 0.0000000, 0.4495387, 0.5030770, 0.3661515, 0.3779645, 0.4961389, 0.426020...
$ n            <int> 3, 87, 82, 252, 7, 40, 56, 273, 103, 108, 306, 69, 109, 351, 5, 166, 142, ...
$ Y            <dbl> 0.00000000, 0.27586207, 0.50000000, 0.15873016, 0.14285714, 0.40000000, 0....
$ cell_id      <chr> "Álava25-2901.primary_or_less", "Álava25-2902.secondary_early", "Álava25-2...

And here is what my first sketch of code looks like without the prior_covariance argument.

st_1115 <- stan_glmer(formula = Y~ 1 +
                                    (1|EDAD) +
                      data = df,iter = 1000, chains =2, cores = 2,
                      prior_aux = cauchy(0,5),
                      prior_intercept = normal(0,100,autoscale = FALSE)

Thank you in advance!



Correct. You would have to install that branch of rstanarm from GitHub.


Thanks for the prompt answer!

How should I proceed? Should I install the development version as indicated here ?



It is not in the development version. You need to call

devtools::install_github("stan-dev/rstanarm", ref = "structured_prior_merge",
                         args = "--preclean", build_vignettes = FALSE)


Hi Ben,

Sorry for being such a pain in the ass, but I am getting the following error message:

Installation failed: Could not find build tools necessary to build rstanarm 

Is there any dependency I need to install previously? I have devtools installed and also the latest version of rstanarm… Thanks again!


Sounds like Windows. Can you execute programs via rstan::stan?


It is weird. I have tried the same thing on my mac mini (i am writing from a macbook pro now) and it worked.

I am able to call standard stan scripts from rstan, yes.


Try it from the Terminal application rather than RStudio.


Thanks! I found out why it was not working. I needed to agree to the xcode license, which it seems I had not. In any case, thanks for your patience, I am leaving this running tonight and hope it will work.

Thanks again!