Calculating pareto-k-diagnostic for datasets that doesn't fit into RAM

Cengiz_Zopluoglu · January 30, 2024, 1:40am

We have a huge dataset with 1.6 million observations, and we fit three different models. We fit the models with four chains and 750 iterations each. So, we have a 3000 x 1.6M loglik_posterior matrix. We saved this from each model run. It takes around 50GB RAM to reload these matrices into R, but we have access to a computational cluster, so it is no big deal.

The problem is when we use this as input in the loo package to compute loo-ic and, more importantly, Pareto-k-diagnostic values. We keep getting OOM errors no matter how big the RAM we try. We can go up to 96GB, but it doesn’t work.

We calculated WAIC using our own code in R. But we’d like to check the k-values before we trust these numbers.

As I explore the loo package, I realize it doesn’t matter whether I provide the full matrix as input or a smaller number of columns as input. The k-value for each observation turn out to be the same.

See below using the example from the loo package.

> LLmat <- example_loglik_matrix()
> sub <- 10:15
> 
> l1 <- loo(LLmat)
Warning message:
Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and 
MCSE estimates will be over-optimistic. 
> 
> l2 <- loo(LLmat[,sub])
Warning message:
Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and 
MCSE estimates will be over-optimistic. 
> 
> pareto_k_values(l1)[sub]
[1] -0.11439294 -0.04461812 -0.06883883 -0.08868121  0.07186615  0.34633282
> 
> pareto_k_values(l2)
[1] -0.11439294 -0.04461812 -0.06883883 -0.08868121  0.07186615  0.34633282

To resolve my OOM problem, I plan to run a loop by feeding smaller chunks of columns instead of providing all 1.6M at once and store the k-values at each iteration.

# d, is the loglik_posterior matrix, 3000 x 1663076

k_values <- rep(0,ncol(d))

loc <- c(seq(1,ncol(d),1000),ncol(d)+1)

for(i in 1:(length(loc)-1)){
  loo_  <- loo(d[,loc[i]:(loc[i+1]-1)])
  k_values[loc[i]:(loc[i+1]-1)] <- pareto_k_values(loo_)
  print(i)
}

Is there any problem with this approach?

jonah · January 30, 2024, 9:46pm

For just checking the k diagnostic it shouldn’t be a problem to do it in a loop like this. For computing the full ELPD it needs to sum over the observations, but just computing k doesn’t require that.

You might also be interested in Using Leave-one-out cross-validation for large data • loo, which uses a subsampling approach to compute LOO for big data.

Cengiz_Zopluoglu · January 30, 2024, 11:24pm

Great! Thanks for confirming this.

We have already calculated the llpd and variance component of WAIC using our custom code dealing with memory issues without using the loo package. It was a straightforward computation once we had the posterior loglikelihood matrix. I didn’t want to try writing the code for calculating these k values, so I want to use the loo package for only computing k-values.

Topic		Replies	Views
Question about large Pareto k value Modeling loo	6	1128	September 13, 2022
Bad Pareto k diagnostic with good chain diagnostics General	12	1859	April 26, 2021
Pareto k values are too high when running model comparison Modeling loo	10	3091	March 11, 2020
Good PP check and R square but large Pareto k values Modeling performance , loo	10	2136	September 2, 2020
Model Assessment for Variational Inference in `cmdstanr` CmdStan cmdstanr , validation	1	363	November 1, 2023

Calculating pareto-k-diagnostic for datasets that doesn't fit into RAM

Related topics