Fitting a large data hierarchical model-can I trust variational inference algorithm?

lguirola · September 4, 2017, 6:59pm

Hi,

I am trying to fit a model with a (two, actually) relatively large (>1.5 million) dataset with a hierarchical model of the type that A. Gelman often suggests to do MRP.

I have found it impossible to run a HMC sampler on it-I killed the job after seeing little progress for more than 1 day. I have tried to use the meanfield algorithm proposed in stan_glmer and it works remarkably well-it’s fast, and it converges, etc.

I wondered, however, if I can trust its estimates -I have seen all kinds of warnings. I have tried to compare its results with less complicated models/smaller samples and it seems to be ok, in general. But for the larger model, I have no benchmark to compare. I can I present these estimates as ok? Are there any diagnostics I can perform? Any workaround you can suggest?

dlakelan · September 4, 2017, 9:50pm

In my experience vb gives poor estimates whenever I use it, but it gives relatively good initial points for HMC.

If the main issue is the size of the dataset, I’d suggest as a way to compare things, to reduce your dataset by a factor of 10 to 100 and run an HMC fit on this, and compare to the vb fit on the same reduced set. If vb gives similar results in your sub-sample, then you have maybe some reason to believe it might be working well in your full sample.

In general, I’d be very clear in any writeup that accuracy of the fit is a concern.

bgoodri · September 5, 2017, 12:24am

The algorithm = "meanfield" argument to stan_glmer is less bad when the outcome is discrete and when you specify QR = TRUE.

Bob_Carpenter · September 10, 2017, 11:44am

We don’t really have a good workflow for diagnosing these models, I’m afraid.

The best you can do is fit with simulated data that you think is like your data. And for that, you can’t really validate coverage/calibration because we know variational methods underestimate uncertainty systematically.

How many iterations were you trying to run for? It may also depend on the way the problem’s coded. There are a lot of ways efficiency can be improved but of course we have no basis for making suggestions without seeing your program.

lguirola · September 24, 2017, 2:34am

Hello,

I tried some of the suggestions and finally found a way forward in this problem. I looked at how Yajuan Si, Rob Trangucci, Jonah Sol Gabry, and Andrew Gelman proceed in their recent paper http://www.stat.columbia.edu/~gelman/research/unpublished/modelweighting.pdf and as you can see (pg21) they proceed in 2 steps:

They first estimate cell-specific means, and record the size of the cell, the variance, etc.
They then fit the model using aggregate estimates as imputs.

This is definitely compatible with my approach, since it drastically reduces the dimensionality of my data (from 1e6observations, to between 6000 and 3000 cells).

However, to introduce the cell size and cell variance in the second step, they use the prior_covariance argument to define random effect, using a function rstanarm::mrp_structured which I believe is not available in my version of rstanarm, Version 2.15.3, packaged: 2017-04-29 06:18:44 UTC, which I believe is the latest one.

How could I define a similar covariance matrix? I have tried to look at the rstanarm manual, but could not understand how the prior_covariance argument is defined.

prior_covariance =
rstanarm::mrp_structured(
cell_size = dat_rstanarm$n,
cell_sd = dat_rstanarm$sd_cell,
group_level_scale = 1,
group_level_df = 1
)

Below I include what my code currently looks like:

Observations: 2,907
Variables: 8
$ PROV         <fctr> Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Álava, Ála...
$ EDAD         <fctr> 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 25-29, 30-34, 30-34, 30-...
$ gender_woman <int> 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,...
$ edu_cis      <fctr> 1.primary_or_less, 2.secondary_early, 3.secondary_upper, 4.upper, 1.prima...
$ sd_cell      <dbl> 0.0000000, 0.4495387, 0.5030770, 0.3661515, 0.3779645, 0.4961389, 0.426020...
$ n            <int> 3, 87, 82, 252, 7, 40, 56, 273, 103, 108, 306, 69, 109, 351, 5, 166, 142, ...
$ Y            <dbl> 0.00000000, 0.27586207, 0.50000000, 0.15873016, 0.14285714, 0.40000000, 0....
$ cell_id      <chr> "Álava25-2901.primary_or_less", "Álava25-2902.secondary_early", "Álava25-2...

And here is what my first sketch of code looks like without the prior_covariance argument.

st_1115 <- stan_glmer(formula = Y~ 1 +
                                    (1|EDAD) +
                                    (1|PROV)+
                                    (1|gender_woman)+
                                    (1|edu_cis),
                      data = df,iter = 1000, chains =2, cores = 2,
                      prior_aux = cauchy(0,5),
                      prior_intercept = normal(0,100,autoscale = FALSE)
                          )

Thank you in advance!

lmg

bgoodri · September 24, 2017, 4:12am

Correct. You would have to install that branch of rstanarm from GitHub.

lguirola · September 24, 2017, 4:17am

Thanks for the prompt answer!

How should I proceed? Should I install the development version as indicated here https://github.com/stan-dev/rstanarm ?

Best

bgoodri · September 24, 2017, 4:28am

It is not in the development version. You need to call

devtools::install_github("stan-dev/rstanarm", ref = "structured_prior_merge",
                         args = "--preclean", build_vignettes = FALSE)

lguirola · September 24, 2017, 4:51am

Hi Ben,

Sorry for being such a pain in the ass, but I am getting the following error message:

Installation failed: Could not find build tools necessary to build rstanarm

Is there any dependency I need to install previously? I have devtools installed and also the latest version of rstanarm… Thanks again!

bgoodri · September 24, 2017, 4:52am

Sounds like Windows. Can you execute programs via rstan::stan?

lguirola · September 24, 2017, 4:58am

It is weird. I have tried the same thing on my mac mini (i am writing from a macbook pro now) and it worked.

I am able to call standard stan scripts from rstan, yes.

bgoodri · September 24, 2017, 5:04am

Try it from the Terminal application rather than RStudio.

lguirola · September 24, 2017, 5:32am

Thanks! I found out why it was not working. I needed to agree to the xcode license, which it seems I had not. In any case, thanks for your patience, I am leaving this running tonight and hope it will work.

Thanks again!

lmg

lzhang28 · October 7, 2021, 4:03pm

Hi,

I am trying to install the branch of rstanarm from GitHub using

devtools::install_github("stan-dev/rstanarm", ref = "structured_prior_merge",
                         args = "--preclean", build_vignettes = FALSE)
#> Downloading GitHub repo stan-dev/rstanarm@structured_prior_merge
#> 
#>          checking for file 'C:\Users\liliz\AppData\Local\Temp\RtmpARaLoj\remotes415c62563ed5\stan-dev-rstanarm-b46c14f/DESCRIPTION' ...  v  checking for file 'C:\Users\liliz\AppData\Local\Temp\RtmpARaLoj\remotes415c62563ed5\stan-dev-rstanarm-b46c14f/DESCRIPTION'
#>       -  preparing 'rstanarm':
#>    checking DESCRIPTION meta-information ...     checking DESCRIPTION meta-information ...   v  checking DESCRIPTION meta-information
#> -  cleaning src
#>       -  running 'cleanup.win'
#>       -  checking for LF line-endings in source and make files and shell scripts (2.8s)
#>       -  checking for empty or unneeded directories
#>       -  building 'rstanarm_2.15.4.tar.gz'
#>   Warning:     Warning: file 'rstanarm/cleanup' did not have execute permissions: corrected
#>      
#> 
#> Installing package into 'C:/Users/liliz/OneDrive/Documents/R/win-library/4.0'
#> (as 'lib' is unspecified)
#> Warning in i.p(...): installation of package 'C:/Users/liliz/AppData/Local/Temp/
#> RtmpARaLoj/file415ca91274/rstanarm_2.15.4.tar.gz' had non-zero exit status
#> Warning: 1 components of `...` were not used.
#> 
#> We detected these problematic arguments:
#> * `args`
#> 
#> Did you misspecify an argument?

Can you help me figure it out? Thanks!

Topic		Replies	Views
Formulating a hierarchical multivariate Gaussian process using the Hilbert space approximation Modeling cmdstanr , gaussian-process	8	440	February 7, 2025
Question: Stan Modelspecification for Hierarchical Linear Model (Gelman's Radon example) RStan rstan	13	1221	November 19, 2019
Cross-classified Hierarchical Model, Heterogenous Variances Modeling	19	1752	August 11, 2018
Help with understanding my results Modeling fitting-issues , specification , performance	5	1271	February 8, 2019
My first Stan model - hierarchical logistic regression Modeling cognitive-science	28	8744	June 6, 2017

Fitting a large data hierarchical model-can I trust variational inference algorithm?

Related topics