# Four questions about information criteria/cross-validation and HMC in relation to a manuscript review

Hi, these questions were originally directed to @avehtari , and he asked me to post to the community at large here so that the answers could be of benefit to multiple people.

Currently, I am in the review process for a manuscript summarizing results for a project I worked on that used Stan (preprint here: https://www.biogeosciences-discuss.net/bg-2020-23/bg-2020-23.pdf). I had some questions I was looking to clarify that were related to the reviewer queries:

1. I had mentioned and cited the demonstrated stability and reliability issues with AIC and DIC in the manuscript. However, a reviewer indicated wanting to see AIC and DIC compared alongside the LOO and WAIC results. I will mention in my response that I will be hard-pressed to calculate AIC from Stan output, as it requires a pointwise maximum likelihood estimate that does not work with non-uniform priors in a Bayesian setting. With respect to DIC, I was going to say that including DIC for comparison was redundant in the presence of WAIC because DIC is calculated from pointwise posterior means and could be thought of as an approximation of WAIC, with WAIC in turn being an approximation of LOO. Regarding DIC, is that something I can justifiably assert?

2. I am trying to develop a better intuition for the PSIS-LOO methodology to be in a position to answer more detailed questions about its computation. From the 2017 Vehtari et al. paper, based on my understanding of importance sampling, IS is necessary to obtain p(\theta | y_{-i}). For each \theta, we can estimate p(y_i | \theta) from the posterior. IS then allows us to obtain p(\theta | y_{-i}) from the posterior (do correct me if I’m wrong in anything I’ve said prior to this). I just wanted a hint on the math steps that lead from Equation (7) in that 2017 paper to Equation (8) to get the LOO lpd corresponding to held out point y_i (equations on page 3 here). And then, for a layman’s summary of the function of IS here, IS helps us estimate the distribution of unobserved data points conditional on the holdout data point?

3. Is the log pseudomarginal likelihood (LPML) that appears in Christensen et al. 2011, “Bayesian Ideas and Data Analysis,” the same as LOO without overfitting/effective parameter count penalty? (a formula for LPML is also available here – https://www4.stat.ncsu.edu/~reich/st740/ModelComparison2.pdf)

Thank you all for your help

1 Like

Yup, AIC requires reducing inferences to a single model configuration so isn’t appropriate for a Bayesian analysis.

Yup. See for example Section 3.3.3 of https://arxiv.org/abs/1506.02273 and the references therein.

The reviewer is strongly misinformed about the role of adapt_delta. I’m not exactly sure which reference would be best because none of them directly address this particular misunderstanding…

If there are no divergences then changing adapt_delta does not change the exploration of the Hamiltonian Markov chains, just how much that exploration costs per iteration. Setting adapt_delta to the default 0.8, which then tunes the step size appropriately, approximately achieves the optimal cost. Setting adapt_delta to larger values induces a less aggressive tuning of the step size, leading to smaller values and more expensive numerical Hamiltonian trajectories, but the same exploration. These results are derived in https://arxiv.org/abs/1411.6669 – note that all of the theorems depend on stable integrators and no divergences.

Divergences indicate not only that the Hamiltonian Markov chains are not fully exploring but also that the the relationship between adapt_delta and the cost is no longer valid, so the default value of 0.8 is irrelevant. All we can do in that case is increase adapt_delta to hopefully lower the step size from the unstable value enough that the numerical integrator becomes stable again. The relationship between divergences and exploration is discussed in Sections 5.3 and 6.2 of https://arxiv.org/abs/1701.02434 and more formally in https://arxiv.org/abs/1601.08057.

I’d be happy to speak with the editor if the reviewer continues to have issues. Good luck!

2 Likes

AIC and DIC should not be considered as approximations of WAIC, but as valid information criterions if the predictions are made based on point estimates of the parameters, see http://dx.doi.org/10.1214/12-SS102

You can use also other approaches to obtain p(\theta | y_{-i}) including MCMC (e.g. by refitting using Stan and y_{-i}.

Maybe this BDA Lecture 9.1 PSIS-LOO and K-fold-CV helps? If not, ask again.

It’s the same as LOO with log score, which is called as elpd_loo in https://arxiv.org/abs/1507.04544. LOO doesn’t need separate overfitting/effective parameter count penalty.

When I said you could have the questions in one thread, I didn’t notice that this one was completely different, but since you tagged Mike, you did get the answer here, too,

1 Like

Much appreciation for the clarification about divergences and the offer to speak to the editor, @betanalpha. Hopefully it doesn’t come to that, but I am grateful for your offer.

So basically, this means AIC cannot apply to non-nested models like two entirely different ODEs, right? Is that a consequence of it being an MLE-derived calculation, and if so, why does MLE require nested models?

Awesome explanation, thanks a lot for this summary.

No, AIC can be used to compare non-nested models as well. By reducing inferences to a single model configuration I just meant that AIC judges a model by predictions that come from a single data generating process \pi(y ; \hat{\theta}(\tilde{y}) where \hat{\theta} is some point estimator informed by the observed data, \tilde{y}.

Because they judge predictions technically all of these information criteria -like objects can be compared to each other so long as the models are all defined on the same observational space.

1 Like

Thanks Aki for your explanations. I will watch the lecture you linked to.

So, I calculated LPML manually for comparison which differed from the LOO results calculated by the loo package. For example (x-axes indicate model parameters changing for a sensitivity analysis done as part of this project).

What additional calculations could be the source of the gap between the LPML calculated using the elpd_loo formula and the loo that is calculated by the package?

Since the answer from @betanalpha was different for this question, just wondering whether there was a different interpretation here between you guys.

In plot (a) you have y-axis as LOO, which is not enough information and based on the scale it’s not LOO with log score ie elpd_loo from loo package. LOO itself doesn’t define which utility or loss is used so you should report what utility or loss is used in axis legend, too.

Can you be more specific how did you calculated LPML?

Can you show the diagnostics for elpd_loo provided by loo function in loo package?

I see. For LOO, I used which ever the default output was. Here is the code I used to calculate LOO/WAIC. In Stan:

generated quantities{
...
log_lik = normal_lpdf(CO2_flux_ratios_vector | CO2_flux_ratios_hat_vector, sigma);
...
}


Subsequently, in R:

fit_log_lik = extract_log_lik(stan_fit)
stan_fit_waic = waic(fit_log_lik)
stan_fit_loo = loo(fit_log_lik)


To calculate LPML, in Stan, obtaining the CPO_i:

generated quantities{
...
CPOi = sqrt(2 * pi()) * sigma * exp(0.5 * (1 / sigma ^ 2) * square(CO2_flux_ratios_vector - CO2_flux_ratios_hat_vector));
...


And subsequently, in R:

LPML = sum(log(1 / colMeans(stan_fit_ex\$CPOi)))


I had saved the Pareto k diagnostics for my runs on my university’s cluster, and unfortunately, it was one of the things that got wiped without backup in a hardware issue. I originally reported them in the manuscript, but took them out when earlier feedback from readers suggested that additional discussion about the diagnostics made the paper more inaccessible for people without any statistics background and that reporting the bad diagnostics made a worse case for the approach (even if diagnostics are a key positive feature of LOO). I do recall that for the model that had no divergent transitions in any of the runs (CON), the diagnostics were mostly good/ok, whereas the model with divergent transitions (AWB) the diagnostics were more bad/very bad, indicating that the model was mis-specified (I mentioned in the paper I was going to reparameterize the AWB ODE system in subsequent papers). My intent is certainly to insist on including the diagnostics reporting into future publications. Upon retrospect, I also probably should have tried using K-fold with k = 10.

The codes took me a long time to run on the cluster, and I’m not able to do so on my personal device right now due to performance constraints, so I won’t be able to re-calculate the diagnostics very easily right now. Stemming from the pandemic, the cluster is not really functional for the moment due to the lack of maintenance staff, so I just have to work with what I have at the moment for the reviewer responses.

I did plot a figure of effective parameter count for CON and AWB.

(a), (c correspond to LOO, and (b), (d) correspond to WAIC. Point of this figure in this post was to indicate that I used a version of LOO and WAIC that output effective parameter count. These were plotted from the p_loo and p_waic values.

To clarify, it all depends on what exactly one is trying to check with these predictive score estimators.

As Aki notes, DIC can be precisely interpreted as a predictive score estimator for predictions arising from \pi(y \mid \mu_{\theta}), where \mu_{\theta} is the posterior mean. In other words it approximately quantifies how good your model predictions are if you only use the posterior mean.

Judging a Bayesian model by only its posterior mean ignores a tremendous amount of information. Most people, especially those arguing that using the entire posterior is better than using any single point estimator, would want to judge a Bayesian model by predictions arising from the entire posterior predictive distribution. The posterior predictive performance score is hard to estimate, but under certain conditions K-fold or leave-one-out cross validation can provide okay estimators. WAIC provides another estimation method. From this perspective DIC can also be interpreted as an estimator of the full posterior predictive performance score, albeit an even worse one than WAIC.

DIC is better suited to estimating posterior mean predictive performance scores, but that interpretation is valid only if you are actually claiming to be evaluating a model based on the predictions generated from posterior means and nothing else. If you are making claims about the entire posterior predictive performance then DIC becomes an less well-suited estimator.

There is a nice unified picture of all of these predictive performance score estimators allowing us to compare all kind of differing fitting approaches from different models, but it requires us to be very explicit about what kind of predictions each method is making so that we can understand what each estimator is trying to estimate. Many are sloppy about this explication, hence all of the confusion in the literature.

1 Like