Memory issues when computing LOO/WAIC

Hi!

Like several other people on this forum I am running into memory issues when trying to compute LOO/WAIC for large datasets. My data has roughly 160,000 observations. The model has three grouping levels, one with 15 categories, one with 16 categories, and one with 240 (15 * 16) categories, along with 5-30 population-level parameters. I am running the model with the usual four chains with 2,000 iterations each and discard 1,000 as burn-in. I am trying to compute LOO/WAIC for model comparison on a machine with 50 Gb memory. Given that I consistently run out of memory before the calculation is done, my question is this: Is there a rule of thumb to calculate how much memory I would need to compute these quantities? I could try to find a machine with more memory. Alternatively, is there a way to calculate the quantities on a subset of the samples? This option seems a little more hackish, but I’m running out of ideas. Any help would be greatly appreciated.

Here’s a vignette showing how to calculate loo for large data: Using Leave-one-out cross-validation for large data • loo

2 Likes

@ssp3nc3r already linked to the vignette (which also lists the papers showing that properly made subset LOO is not justified and not hacky)

How many observations you have per group (with unique category combination)? If that is large, and the posterior is not changing much when you leave out just one observation out of 160,000, then it is possible you don’t even need LOO (or WAIC). On average, you have 531 observations per parameter, which is a lot, and 666 observations per each of the 240 categories is a lot, too, but then it matters if some categories have much fewer observations.

1 Like

Hi.

I’ve stumbled upon this thread due to an error previously discussed here.

I’ve looked at the vignette you’ve mentioned, but feel like my case would be difficult to adapt for loo_subsample.

I’m working with a drift-diffusion model which has coefficients varying by participant and another set of coefficients varying by item. The data supplied to the Stan model is not rectangular, but JSON-like (for example, the participant-level data have N entries, while item-level data have M entries).

Looking at the vignette, my current understanding is that a data.frame-like object is to be supplied to the log-likelihood function so that it’s rows can be subset. If so, in my case, this would mean restructuring the data into a rectangular shape, because the simple row-indexing used for subsetting would not work with my current data set. This is rather inconvenient.

I was wondering whether it is possible to do the subsetting within the generated quantities block of a Stan model? For example, by calculating the log-likelihoods only for each N^{\textrm{th}} entry, and setting the others to NaN or something like that? And running the normal loo function afterwards?

I tried looking through the source code of the loo_subsample function, but there’s a lot going on, and I didn’t feel comfortable determining what’s going on, so am not able to tell whether this approach is too naive.

Just for context, I’m working with a data set with some 120 k observations, obtained from some 100 participants for approximately 5000 items. I tried running the LOO calculation on a supercomputer I have access to, working with 1 TB of RAM. This setup produced the error mentioned in the post linked at the beginning.

loo doesn’t support having NaNs.

If I understood correctly, loo_subsample() is complicated in your case. loo package has non-exported function srs_diff_est which implements the difference estimator (Eq 7 in Leave-One-Out Cross-Validation for Bayesian Model Comparison in Large Data paper). It might be easier to use this function directly. I made and issue. Meanwhile you can use that function, which I copy here

#' Difference estimation using SRS-WOR sampling (Magnusson et al., 2020)
#' @noRd
#' @param y_approx Approximated values of all observations.
#' @param y The values observed.
#' @param y_idx The index of `y` in `y_approx`.
#' @return A list with estimates.
srs_diff_est <- function(y_approx, y, y_idx) {
  checkmate::assert_numeric(y_approx)
  checkmate::assert_numeric(y, max.len = length(y_approx))
  checkmate::assert_integerish(y_idx, len = length(y))

  N <- length(y_approx)
  m <- length(y)
  y_approx_m <- y_approx[y_idx]

  e_i <- y - y_approx_m
  t_pi_tilde <- sum(y_approx)
  t_pi2_tilde <- sum(y_approx^2)
  t_e <- N * mean(e_i)
  t_hat_epsilon <- N * mean(y^2 - y_approx_m^2)

  est_list <- list(m = length(y), N = N)
  # eq (7)
  est_list$y_hat <- t_pi_tilde + t_e
  # eq (8)
  est_list$v_y_hat <- N^2 * (1 - m / N) * var(e_i) / m
  # eq (9) first row second `+` should be `-`
  # Supplementary material eq (6) has this correct
  # Here the variance is for sum, while in the paper the variance is for mean
  # which explains the proportional difference of 1/N
  est_list$hat_v_y <- (t_pi2_tilde + t_hat_epsilon) - # a (has been checked)
    (1/N) * (t_e^2 - est_list$v_y_hat + 2 * t_pi_tilde * est_list$y_hat - t_pi_tilde^2) # b
  est_list
}

You would first compute desired approximate value for all observations and have those in y_approx. Then for the random subset you would compute the more accurate estimates and have those in y. And y_idx has same length as y and tells for which observations the more accurate values are.

Hi!

Thank you a lot for responding in such detail. I have refactored my code in the meantime so that I can work with the loo_subsample function as-is. However, it seems that having a way around that would be convenient with non-rectangular data which may be common for hierarchical models (?).

To ask, if I may - you’ve said that “loo doesn’t support having NaNs.” Does this mean that the approach could work in principle? For example, if the NaNs were to be filtered out? Or is the approach completely off?

Yes, you can filter NaN’s out, and get a loo object with reduced size (that is having pointwise values only for those observations that had no NaNs).

1 Like