Waic or Model Comparison for a big (hierarhical) model: memory efficient methods?

Hi,

Is there a way to add_criterion in brms() using a memory efficient option, something comparable to the loo.function method. You see I have a large number of data points, around 300,000, and quite a few parameters (around 700). The model contains firms in countries across time, with random effects (at the country:year level) as well as fixed effects.

Alternatively, do you think it is `fine’ for model comparison purposes to either:

  • Use a sub-sample of the fitted model and calculate the WAIC on that; Or
  • Divide the fitted model into 3 sub-periods, say 1994-2001; 2002-2007; 2008-2017. And then look at the WAIC for each time period and compare the WAIC for these time periods across different models.

I know K-Fold is being recommended for hierarchical model comparisons but given the size of this model I am not sure if computationally feasible(?)

Many thanks,

Hi Ilan,

First, I’m not sure of the answer to your “add_criterion” question, but there are a couple of questions I have that you might want to think about.

700 parameters is a lot. Given that you could estimate these parameters as random effects, and use far less degrees of freedom, is there a particular reason you’re running these parameters as fixed?

Regarding time, it depends on your data. If your data points are divided evenly over the 25 year period, and you’re making it categorical (with 3 periods), you’re losing good data. Why not run time as a fixed-effect with a random slope of time? (It’s not clear what your variable is within firm; is that continuous or categorical; is that your dv or iv?)

If you’re looking to do k-fold, I haven’t found a way to do a multiple time-series outside of brms, but I also don’t know how to do it in brms (I’m now learning brms).

loo and brms will soon support sub sampling for LOO as described in Bayesian leave-one-out cross-validation for large data (we don’t recommend waic due to the lack of good diagnostic for reliability)

1 Like

Thanks for this @avehtari. I see in the ‘supplementary pdf’ the following note:

The functions are implemented based upon the loo package structure as the functions quickloo(),approx_psis()and psis_approximate_posterior(). An example how to run the code can be found in the documentation for quickloo(). No changes to author lists, versions ordate has been changed to preserve anonymity. If accepted, the code will be published open source.

I could not find the documentation though. I went to the `code’ section of the paper and it took me to the Loo homepage. I couldn’t find anything further on this there. Also tried reinstalling loo() package from github. Are you please able to share the provisional functions? I am using brms() objects. Many thanks.

I wrote “loo and brms will soon support sub sampling for LOO”. There will be documentation in loo after that support has been added. I will post here when there is more information available.

1 Like

Yes - it was more wishful thinking on my behalf(!) since I hope to use this for my dissertation (due soon). Thank you for the update and looking forward to hearing further.

There is a brench supporting this, but it is just a beta version and has not been merged yet. See https://github.com/stan-dev/loo/pull/113. Today I opened a branch of brms to support subsampling loo, but this is just a very early version. See https://github.com/paul-buerkner/brms/tree/ssloo

2 Likes

Exciting - thank you, Paul!