Model comparison - large model

loo

#1

Hi,
I am trying to compare two models using the loo package.
Number of data points = 600,000, post-warmup iterations = 2000, # chains = 10
To compute log likelihood from all samples, I need a matrix of size 600K x 20K. This would take very long time and memory.
Any recommendations to make this more efficient?
Can I only use a small number of iterations instead of all 2000? any other suggestions?

Thanks!


#2

The “pass a function that evaluates the log-likelihood of the i-th observation” method described at
http://mc-stan.org/loo/reference/loo.html

For models fit to very large datasets we recommend the loo.function method, which is much more memory efficient than the loo.matrix method.


#3

Ben’s suggestion is good, too. Here are couple other suggestions.

Take a smaller (whatever is fast enough for you) random sample of data points, compute log likelihood for those, compute elpd_loo for this smaller random sample and use the usual statistical inference to estimate what would be elpd_loo for the whole n=600K. We use this kind of approach succesfully in projpred to speed-up computation in case of large n.

You can also use less iterations (for example by thinning), but check N_eff and I recommend having N_eff>1000 for PSIS-LOO.


#4

Thanks @bgoodri and @avehtari.
Without rerunning the model with less iterations, can I get a smaller random sample of posterior draws and estimate the loglikelihood on the full data set and compare two models?


#5

You would have to manually throw away draws, which just makes the estimates less precise. Use the loo.function method.


#6

Okay, thanks! I will try that way.