Model comparison - large model


I am trying to compare two models using the loo package.
Number of data points = 600,000, post-warmup iterations = 2000, # chains = 10
To compute log likelihood from all samples, I need a matrix of size 600K x 20K. This would take very long time and memory.
Any recommendations to make this more efficient?
Can I only use a small number of iterations instead of all 2000? any other suggestions?



The “pass a function that evaluates the log-likelihood of the i-th observation” method described at

For models fit to very large datasets we recommend the loo.function method, which is much more memory efficient than the loo.matrix method.


Ben’s suggestion is good, too. Here are couple other suggestions.

Take a smaller (whatever is fast enough for you) random sample of data points, compute log likelihood for those, compute elpd_loo for this smaller random sample and use the usual statistical inference to estimate what would be elpd_loo for the whole n=600K. We use this kind of approach succesfully in projpred to speed-up computation in case of large n.

You can also use less iterations (for example by thinning), but check N_eff and I recommend having N_eff>1000 for PSIS-LOO.


Thanks @bgoodri and @avehtari.
Without rerunning the model with less iterations, can I get a smaller random sample of posterior draws and estimate the loglikelihood on the full data set and compare two models?


You would have to manually throw away draws, which just makes the estimates less precise. Use the loo.function method.


Okay, thanks! I will try that way.