Minimising k-hat

Hi Stanimals,

I was wondering whether anything (experiments/heuristics/theory) speaks against using an optimiser to minimise \hat{k} (for example in combination with ADVI, by automatically adjusting parameters like tol_rel and eta) instead of tuning the parameters involved (like tol_rel or eta) systematically/manually (i.e. de-creasing) until \hat{k} is below, say, 0.7, for the first time (I think I remember @avehtari mentioning this to me).

Short experiment: Ignoring, for now, that there are other ways to tune the parameters below, I came up with the following toy-example (not ADVI):

mu_true <- pi
sigma_true <- 1/sqrt(2)
N <- 10000

target_func <- function(x) {
  mu_trial <- x[1]
  samps <- rnorm(N,mu_trial, sigma_true)
  log_ratios <- dnorm(samps, mu_true, sigma_true, log=T)-dnorm(samps, mu_trial, sigma_true,log=T)
  psis(log_ratios=log_ratios, r_eff=NA, cores=2)$diagnostics$pareto_k

rslts <- purrr::map_dbl(1:5000, ~optim(c(2),target_func,method="BFGS")$par)

  geom_histogram(aes(x=x), binwidth = .05, fill=NA, color="black") +

This produces the following figure

The peak around 2, might be an artefact because I was not careful with the optimiser options. Nevertheless, I wanted to check whether something else could cause that minimising \hat{k} can go wrong (after all, low \hat{k} is not a sufficient condition for a good approximation).

Thank you!

target_func is stochastic due to rnorm. BFGS assumes deterministic target function and is likely to fail for stochastic target. There is couple papers on stochastic BFGS, but I haven’t seen good implementations. You could move rnorm out of the target_func to have deterministic function, but it would have some bias and likely would have non-smooth gradient.

We use iterative moment matching to improve k-hat in Pushing the Limits of Importance Sampling through Iterative Moment Matching.