Extract log-likelihood from large size stanfit using the Loo package

loo

#1

Hi everyone,

I am a beginner using Rstan package and loo package. I have a quick question for using loo package. I want to get loo and waic for the purpose of model comparison. In my stan code, I define a variable called log_lik for Loo computation in the generated quantities block. (although it is not a parameter) I made it as a vector because I use long format data to delete missing responses.

generated quantities {
vector[N] log_lik;
real deviance;
for (n in 1:N)
log_lik[n] = pcm(y[n], theta[pp[n]], to_vector(delta[ii[n]]));
deviance = sum(-2*log_lik);
}

In my current model (crossed random effects in IRT model), I simulated the data for 500 persons and 60 items (N=30,000 data points in total) and fit the model with 1000 iterations and 4 chains. And then, I got large size stanfit output (482.2 Mb). Due to define ‘log_lik’ variable in stan code, my stanfit output size was much larger than expected.

The log_lik array in the output looks like 500 (sampling draw after warmup) X 4 (chains) X 30,000(data points), which is very huge. I tried to extract pointwise log-likelihood values from the stan output, but I failed to extract them with an error message below (neither merge_chains = FALSE nor TRUE).

log_lik <- loo::extract_log_lik(stanfit, “log_lik”)
Error: cannot allocate vector of size 457.8 Mb

  1. Is there any solution to deal with large stanfit size for computing loo and waic?

  2. Is it okay to compute loo and waic based on the log-likelihood values from just 1 or 2 chains rather than all 4 chains?

  3. If okay, how can I extract 1 or 2 chains’s values from the stan ouput and get loo?

  4. After calculating loo and waic, I want to remove log-likelihood draws from the output object to reduce the file size. Is there any simple way to remove them?

Thanks for any help in advance! Have a great weekend.

Best,
JinHo


#2

From http://mc-stan.org/loo/reference/loo.html

for models fit to very large datasets the loo.function method is more memory efficient and may be preferable.

function: A function f that takes arguments data_i and draws and returns a vector containing the log-likelihood for a single observation i evaluated at each posterior draw. The function should be written such that, for each observation i in 1:N, evaluating f(data_i = data[i, drop=FALSE], draws = draws) results in a vector of length S (size of posterior sample). The log-likelihood function can also have additional arguments but data_i and draws are required.


#3

Thank you so much for your quick answer!

I have looked at the example for the loo.function method in the current loo package document. However, for draw argument, “fake_posterior” was generated just by r function. Is there any example of it using stan model output?

I am wondering how to extract the entire posterior draws of relevant parameters (to compute the log-likelihood) from the stan output object. What should the object of draws look like? It may be a list form due to different dimensions of each parameters.

Many thanks!


#4

If you want to use Stan functions, you need to define them in the functions block and then use rstan::expose_stan_functions. If you do as.matrix or as.data.frame, you will get an object whose rows are equal to the number of retained draws.


#5

Thank you for your help.
The function method works well. Although it takes considerable time to compute a loo from the large data points, it was much more memory efficient, so finally I got the results!