Four questions about information criteria/cross-validation and HMC in relation to a manuscript review

Hi, these questions were originally directed to @avehtari , and he asked me to post to the community at large here so that the answers could be of benefit to multiple people.

Currently, I am in the review process for a manuscript summarizing results for a project I worked on that used Stan (preprint here: https://www.biogeosciences-discuss.net/bg-2020-23/bg-2020-23.pdf). I had some questions I was looking to clarify that were related to the reviewer queries:

  1. I had mentioned and cited the demonstrated stability and reliability issues with AIC and DIC in the manuscript. However, a reviewer indicated wanting to see AIC and DIC compared alongside the LOO and WAIC results. I will mention in my response that I will be hard-pressed to calculate AIC from Stan output, as it requires a pointwise maximum likelihood estimate that does not work with non-uniform priors in a Bayesian setting. With respect to DIC, I was going to say that including DIC for comparison was redundant in the presence of WAIC because DIC is calculated from pointwise posterior means and could be thought of as an approximation of WAIC, with WAIC in turn being an approximation of LOO. Regarding DIC, is that something I can justifiably assert?

  2. I am trying to develop a better intuition for the PSIS-LOO methodology to be in a position to answer more detailed questions about its computation. From the 2017 Vehtari et al. paper, based on my understanding of importance sampling, IS is necessary to obtain p(\theta | y_{-i}). For each \theta, we can estimate p(y_i | \theta) from the posterior. IS then allows us to obtain p(\theta | y_{-i}) from the posterior (do correct me if I’m wrong in anything I’ve said prior to this). I just wanted a hint on the math steps that lead from Equation (7) in that 2017 paper to Equation (8) to get the LOO lpd corresponding to held out point y_i (equations on page 3 here). And then, for a layman’s summary of the function of IS here, IS helps us estimate the distribution of unobserved data points conditional on the holdout data point?

  3. Is the log pseudomarginal likelihood (LPML) that appears in Christensen et al. 2011, “Bayesian Ideas and Data Analysis,” the same as LOO without overfitting/effective parameter count penalty? (a formula for LPML is also available here – https://www4.stat.ncsu.edu/~reich/st740/ModelComparison2.pdf)

  4. Here is a question that is unrelated to the ones above. I set my adapt delta and step size to 0.9995 and 0.001 respectively in order to mitigate divergent transitions I had. One ODE model I used has some parameterization issues that will be refined in future projects, and this definitely contributed to the divergent transitions. Increasing my adapt delta and reducing my step size HMC parameters did not entirely get rid of my divergent transitions, but did reduce their overall number per run. I set my adapt delta and step size accordingly based on some older user reports in the deprecated Stan Google Group. One reviewer critiqued my adapt delta and step size parameter settings as being extreme (potentially would limit amount of parameter space being explored) and asked for a literature citation that supported modifying adapt delta and step size parameters to reduce divergent transitions. If I have enough posterior samples, would my adapt delta and step size values still be unforgivably problematic? Also, I did not find anything after some days of Googling, but perhaps I was not using the right keywords – are there any papers that discusses adapt delta and step size with respect to divergent transitions? I thought some of @betanalpha’s writings might, but did not initially find anything that more formally would allow me to justify that use of a high adapt delta and low step size would not inhibit parameter space exploration too much.

Thank you all for your help

2 Likes

Yup, AIC requires reducing inferences to a single model configuration so isn’t appropriate for a Bayesian analysis.

Yup. See for example Section 3.3.3 of [1506.02273] A Unified Treatment of Predictive Model Comparison and the references therein.

The reviewer is strongly misinformed about the role of adapt_delta. I’m not exactly sure which reference would be best because none of them directly address this particular misunderstanding…

If there are no divergences then changing adapt_delta does not change the exploration of the Hamiltonian Markov chains, just how much that exploration costs per iteration. Setting adapt_delta to the default 0.8, which then tunes the step size appropriately, approximately achieves the optimal cost. Setting adapt_delta to larger values induces a less aggressive tuning of the step size, leading to smaller values and more expensive numerical Hamiltonian trajectories, but the same exploration. These results are derived in [1411.6669] Optimizing The Integrator Step Size for Hamiltonian Monte Carlo – note that all of the theorems depend on stable integrators and no divergences.

Divergences indicate not only that the Hamiltonian Markov chains are not fully exploring but also that the the relationship between adapt_delta and the cost is no longer valid, so the default value of 0.8 is irrelevant. All we can do in that case is increase adapt_delta to hopefully lower the step size from the unstable value enough that the numerical integrator becomes stable again. The relationship between divergences and exploration is discussed in Sections 5.3 and 6.2 of https://arxiv.org/abs/1701.02434 and more formally in [1601.08057] On the Geometric Ergodicity of Hamiltonian Monte Carlo.

I’d be happy to speak with the editor if the reviewer continues to have issues. Good luck!

3 Likes

AIC and DIC should not be considered as approximations of WAIC, but as valid information criterions if the predictions are made based on point estimates of the parameters, see A survey of Bayesian predictive methods for model assessment, selection and comparison

You can use also other approaches to obtain p(\theta | y_{-i}) including MCMC (e.g. by refitting using Stan and y_{-i}.

Maybe this BDA Lecture 9.1 PSIS-LOO and K-fold-CV helps? If not, ask again.

It’s the same as LOO with log score, which is called as elpd_loo in [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. LOO doesn’t need separate overfitting/effective parameter count penalty.

When I said you could have the questions in one thread, I didn’t notice that this one was completely different, but since you tagged Mike, you did get the answer here, too,

1 Like

Much appreciation for the clarification about divergences and the offer to speak to the editor, @betanalpha. Hopefully it doesn’t come to that, but I am grateful for your offer.

So basically, this means AIC cannot apply to non-nested models like two entirely different ODEs, right? Is that a consequence of it being an MLE-derived calculation, and if so, why does MLE require nested models?

Awesome explanation, thanks a lot for this summary.

No, AIC can be used to compare non-nested models as well. By reducing inferences to a single model configuration I just meant that AIC judges a model by predictions that come from a single data generating process \pi(y ; \hat{\theta}(\tilde{y}) where \hat{\theta} is some point estimator informed by the observed data, \tilde{y}.

Because they judge predictions technically all of these information criteria -like objects can be compared to each other so long as the models are all defined on the same observational space.

1 Like

Thanks Aki for your explanations. I will watch the lecture you linked to.

So, I calculated LPML manually for comparison which differed from the LOO results calculated by the loo package. For example (x-axes indicate model parameters changing for a sensitivity analysis done as part of this project).

What additional calculations could be the source of the gap between the LPML calculated using the elpd_loo formula and the loo that is calculated by the package?

Since the answer from @betanalpha was different for this question, just wondering whether there was a different interpretation here between you guys.

In plot (a) you have y-axis as LOO, which is not enough information and based on the scale it’s not LOO with log score ie elpd_loo from loo package. LOO itself doesn’t define which utility or loss is used so you should report what utility or loss is used in axis legend, too.

Can you be more specific how did you calculated LPML?

Can you show the diagnostics for elpd_loo provided by loo function in loo package?

We had equivalent answers with different words.

I see. For LOO, I used which ever the default output was. Here is the code I used to calculate LOO/WAIC. In Stan:

generated quantities{
...
log_lik = normal_lpdf(CO2_flux_ratios_vector | CO2_flux_ratios_hat_vector, sigma);
...
}

Subsequently, in R:

fit_log_lik = extract_log_lik(stan_fit)
stan_fit_waic = waic(fit_log_lik)
stan_fit_loo = loo(fit_log_lik)

To calculate LPML, in Stan, obtaining the CPO_i:

generated quantities{
...
CPOi = sqrt(2 * pi()) * sigma * exp(0.5 * (1 / sigma ^ 2) * square(CO2_flux_ratios_vector - CO2_flux_ratios_hat_vector));
...

And subsequently, in R:

LPML = sum(log(1 / colMeans(stan_fit_ex$CPOi)))

Entire codes also attached for more context. AWB_adriana_pools5.stan (8.4 KB) stan_AWB_adriana_pools5b.r (20.9 KB) stan_CON_adriana_pools4pb.r (18.3 KB) CON_adriana_pools4p.stan (6.7 KB)

I had saved the Pareto k diagnostics for my runs on my university’s cluster, and unfortunately, it was one of the things that got wiped without backup in a hardware issue. I originally reported them in the manuscript, but took them out when earlier feedback from readers suggested that additional discussion about the diagnostics made the paper more inaccessible for people without any statistics background and that reporting the bad diagnostics made a worse case for the approach (even if diagnostics are a key positive feature of LOO). I do recall that for the model that had no divergent transitions in any of the runs (CON), the diagnostics were mostly good/ok, whereas the model with divergent transitions (AWB) the diagnostics were more bad/very bad, indicating that the model was mis-specified (I mentioned in the paper I was going to reparameterize the AWB ODE system in subsequent papers). My intent is certainly to insist on including the diagnostics reporting into future publications. Upon retrospect, I also probably should have tried using K-fold with k = 10.

The codes took me a long time to run on the cluster, and I’m not able to do so on my personal device right now due to performance constraints, so I won’t be able to re-calculate the diagnostics very easily right now. Stemming from the pandemic, the cluster is not really functional for the moment due to the lack of maintenance staff, so I just have to work with what I have at the moment for the reviewer responses.

I did plot a figure of effective parameter count for CON and AWB.


(a), (c correspond to LOO, and (b), (d) correspond to WAIC. Point of this figure in this post was to indicate that I used a version of LOO and WAIC that output effective parameter count. These were plotted from the p_loo and p_waic values.

To clarify, it all depends on what exactly one is trying to check with these predictive score estimators.

As Aki notes, DIC can be precisely interpreted as a predictive score estimator for predictions arising from \pi(y \mid \mu_{\theta}), where \mu_{\theta} is the posterior mean. In other words it approximately quantifies how good your model predictions are if you only use the posterior mean.

Judging a Bayesian model by only its posterior mean ignores a tremendous amount of information. Most people, especially those arguing that using the entire posterior is better than using any single point estimator, would want to judge a Bayesian model by predictions arising from the entire posterior predictive distribution. The posterior predictive performance score is hard to estimate, but under certain conditions K-fold or leave-one-out cross validation can provide okay estimators. WAIC provides another estimation method. From this perspective DIC can also be interpreted as an estimator of the full posterior predictive performance score, albeit an even worse one than WAIC.

DIC is better suited to estimating posterior mean predictive performance scores, but that interpretation is valid only if you are actually claiming to be evaluating a model based on the predictions generated from posterior means and nothing else. If you are making claims about the entire posterior predictive performance then DIC becomes an less well-suited estimator.

There is a nice unified picture of all of these predictive performance score estimators allowing us to compare all kind of differing fitting approaches from different models, but it requires us to be very explicit about what kind of predictions each method is making so that we can understand what each estimator is trying to estimate. Many are sloppy about this explication, hence all of the confusion in the literature.

3 Likes

This was helpful. Based on your Stan code, you are not computing elpd_loo and WAIC correctly.

To compute loo and waic, instead of

real log_lik;
log_lik = normal_lpdf(CO2_flux_ratios_vector | CO2_flux_ratios_hat_vector, sigma);

you should have

vector[N_t] log_lik;
for (n in 1:N_t) {
  log_lik[n] = normal_lpdf(CO2_flux_ratios_vector[n] | CO2_flux_ratios_hat_vector[n], sigma);
}

which is then log of inverse of CPOi. Without that for loop, normal lpdf returns sum of the pointwise log_lik values. See the vignette Writing Stan programs for use with the loo package.

This

LPML = sum(log(1 / colMeans(stan_fit_ex$CPOi)))

is same as importance sampling LOO, and after you change that log_lik computation this

loo(fit_log_lik, is_method = "sis")

should give the same result up to numerical accuracy. loo package makes the computation numerically more stable way than your LPML computation. The default in the loo package is to use is_method = "psis" which uses also Pareto smoothed importance sampling for improved stability (and diagnostics).

It would be helpful if you can show the diagnostics even for just one posterior fit (after fixing log_lik computation) as it will help to diagnose that computed result is sensible

2 Likes

Thank you, @avehtari. This was incredibly helpful and I’ve learned a lot. It was unfortunate that I did those simulations before having a chance to read the vignette (did the simulations in the summer of 2019). I’ll definitely be referring to that document in the future. Much appreciation again to you and @betanalpha for your help.

1 Like

@avehtari With respect to https://aalto.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=50b2e73f-af0a-4715-b627-ab0200ca7bbd, I was also wondering you could share the link for the lecture preceding that one? Thank you.

All lectures, slides, additional notes, demos, exercises and extra material are available at