Hi guys, is there a canonical or recommended way that I can produce something like WAIC or PSIS-Loo statistics when using the optimizing function? Any help here would be appreciated.
The recommended approach is described in Bayesian leave-one-out cross-validation for large data. Github version of rstanarm supports this already (there has been some serious problems to get new rstanarm cran release). rstan will possibly get the support during the autumn. We’ll make vignettes, but meanwhile you can check the code in rstanarm.
thanks @avehtari, I had a look at your psis-loo function on github. Say, I’ve fitted some simple model using optimizing and L-BFGS. It produces some MLE point values for \theta that I treat as my posterior parameters. For each point y in my dataset I calculate log(p(y\mid \theta)), the array of which is I feed into your psisloo function.
Is that right? I presume the procedure will be the same for the psis-loo for big data from a user’s perspective?
See also the paper I linked. Point estimate is not enough for PSIS-LOO, you need draws from the approximate posterior.
Hey thanks for the response. I had a read of the paper, and the general bits about WAIC in BDA3. It seems to me that these methods cannot be used with the optimizing function because it’s a point MLE rather than a posterior distribution.
Could you confirm that is correct? If so, is there a recommended approach when using optimizing (say with BFGS)? If not, would you mind just briefly describing how to produce the data needed for PSIS-Loo from MLE estimates?
It seems to me that k-fold CV with log posterior sum is my best option?
What are you trying to achieve? If you want to evaluate the performance of a model on withdrawn data from the MAP solution, you could use k-fold cross-validation, as long as the MAP solution is obtained in each training set. If you also want to get an indication of influence of each observation (something that WAIC and PSIS-LOO can give you), then I really think you need the full posterior and not only a point estimate.
Thanks @mcol I have a k-mixture model fitted with BFGS and I’d like to decide the value of k by comparing the results for k=1,2,…10. Stan doesn’t support discrete parameters so I can’t optimise for k. It would seem my only options are to use AIC for in-sample, but I fear the structure of my model will make AIC very inaccurate, or k-fold but with a sizeable computation cost.
Do you have any advise?
rstanarm with algorithm=‘optimizing’ Hessian at the mode is computed and used to form a Gaussian approximation of the posterior and thus it’s not just a point estimate. In this it is possible to use PSIS-LOO. This same could be done with
rstan, and the code showing how to do it can be found in the
rstanarm repo. The reason why
rstan doesn’t compute a Gaussian approximation with optimizing by default is that there are cases where optimizing is used for general optimization and also because it’s quite likely that the posterior is not well approximated by Gaussian. In case of
rstanarm, Gaussian approximation is computed only for models for which we know that Gaussian approximation is likely to be good if there is enough data (and we have diagnostics to check when this is true). Since you asked how to produce WAIC or PSIS-LOO, I assumed you want the Bayesian solution which is what I answered.
If you think MLE (ie no priors) is good solution, then it’s likely that Gaussian approximation would be good, too, and you could do better than just a point estimate.
If instead of WAIC or PSIS-LOO (which are for Bayesian case with posterior or posterior approximation), you want just LOO for point estimate, you could use, for example, A Swiss Army Infinitesimal Jackknife and autodiff to compute LOO approximation. I don’t know if there is code available.
My recommendation is to compute Hessian and Gaussian approximation and use PSIS-LOO.
Depends whether 1) Gaussian approximation works well, or 2) you can implement the Gaussian approximation or have time to wait that someone else implements it.
Note that unless you know the true shape of the mixture components, then k is not identifiable. If k is not identifiable you could just use the largest k which is computationally feasible, from which we get to the problems of computation for mixture models as discussed, e.g., in @betanalpha’s case study
Thank you kindly. That’s everything that I needed. I really appreciate the help.