Can Stan’s optimizing function be used as a computationally pragmatic way of running k-fold cross validation on models that are fitted to very large datasets?
An example:
I have several binomial models that I wish to fit to a dataset that contains approximately 420,000 observations. The models vary in complexity. Some contain only a couple of parameters. Others also contain multiple random effects and could be estimating as many as 1000 parameters in total. I’ve had to up the adapt_delta (to 0.99), lower the stepsize (to 0.1), and increase max tree depth to 15 to minimise divergent transitions. As you’d expect the models are somewhat slow ranging from a day to as much two weeks to complete 2000 iterations (average tree depth approximately 11 or 12). This is also after using all the tricks (non-centered parameterisation, vectorising etc).
The aim is to use the cross validation procedure to choose the “best” predictive model according the average logloss.
My question is this: Is there any major problem with just estimating the maximum likelihood estimate for logloss (my measure of predictive performance) for all folds using stan’s optimizing()
function?
I understand that this restricts me to point estimates, but given the computation is soo intensive it is an option considered only out of pragmatism and potential statistical naivety. I’ve ran tests where I have fitted a model using stan()
and again using optimising()
and the results for logloss and log_gamma were similar. But the standard deviation for the random effect census
seemed to be severely underestimated. I saw a similar pattern (but more severe when I tried using stan’s vb()
function.
OPTIMIZING RESULTS (took about 1 minute)
fit variable
1 -3.522142e+00 log_gamma
2 6.936385e-08 sigma_log_census_err
3 1.000000e+00 census_err[1]
4 1.000000e+00 census_err[2]
5 1.000000e+00 census_err[3]
6 3.973856e-01 logloss_heldout
STAN FIT - (took about 48 hours)
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
log_gamma -3.4034 0.0442 0.5280 -3.9855 -3.6131 -3.5024 -3.3448 -1.9677 143 1.0223
sigma_log_census_err 0.5318 0.0443 0.6758 0.1126 0.2106 0.3335 0.5427 2.3883 233 1.0171
census_err[1] 1.1612 0.0294 0.4148 0.2530 1.0065 1.1733 1.3132 1.8968 198 1.0098
census_err[2] 0.9306 0.0236 0.3324 0.2018 0.8032 0.9416 1.0513 1.5309 198 1.0098
census_err[3] 0.8000 0.0203 0.2863 0.1741 0.6900 0.8077 0.9047 1.3145 199 1.0097
logloss_heldout 0.3954 0.0000 0.0000 0.3954 0.3954 0.3954 0.3954 0.3955 2357 0.9994
lp__ -152130.1249 0.1006 1.9768 -152134.7349 -152131.1996 -152129.8003 -152128.7560 -152127.1008 386 1.0170
Based on these results I’m guessing that I shouldn’t rely on the optimizing function to accurately estimate logloss. And I’m guessing that the approach isn’t useful for hierarchical models (i.e. with random variables) because of the problem of estimating their variances.
If this is not the approach to take, and one wants to undertake true cross validation and not approximate it, are there faster alternatives that don’t involve reducing the dataset? Does stan still plan to implement RHMC anytime soon?
Note: The plan would be to still use full Bayes (i.e. stan()
) on the “best” model for examining parameter effects and general model inference.