Loo 2.0 with very large stanfit crashes R

loo

#1

I have a very large Stan multi-state mark recapture model (N = 200,000+). I calculate log-likelihoods for each individual by summing over all possible transitions for those individuals with missing observations. The resulting Stanfit object is very large, approximately 8 GB for 2000 post warmup iterations. I calculate log_lik in the generated quantities block, and the results of loo::extract_log_lik is a 3 GB matrix.

The standard loo methods fail on me with the error:
“Error in serialize(data, node$con) : error writing to connection”

I imagine this is because I’m running out of memory and I ran into similar problems with earlier versions of loo. Previously, I solved this by using the function method with a boringly trivial function in R:

ll_fun <- function(data_i, draws) {
   data_i*draws
}

Where data_i is actually the transposed log_lik matrix, (N x S) and draws is a S x 1 vector of 1’s. The reason I’ve taken this approach is that each individual log-likelihood relies on the relatively small data of an observed K-occasion capture history, but potentially a very large number of parameters (thousands). I have a pretty efficient algorithm to calculate the log-likelihood but it’s easier for me to simply write this as a block in the model section then it is to write a function which takes a large number of arguments. So I don’t have a custom Stan function to calculate the log-likelihoods . I mention this because in a similar thread on this topic, it was suggested that the function method of calculating loo is more memory efficient, but every example I’ve seen relies on taking the data and the draws of the parameters to calculate log_lik in R outside of Stan. I don’t want to do that as the list of matrices of parameters will be almost as large as the matrix of log_lik values, plus I already calculated log_lik and went to the trouble of storing it in memory.

This hack worked with loo prior to version 2.0.0. However, this no longer works with loo 2.0.0. It runs for a bit with my CPU fan making some noise to let me know it’s working real hard. I then get a memory warning from Windows shortly before RStudio crashes. I’m running on a machine with 8 cores and 64 GB of RAM.

So my question is how do I hack the new version of loo to operate row-wise on the matrix of log_lik extracted from stanfit when memory is an issue?


#2

This whole “calculate log_lik in generated quantities thing” is for situations where RAM is not an issue. The foo.function method is the more general way of doing it, which is used by rstanarm and brms.

You can write a function that takes data_i, creates a new empty stanfit object that is instantiated with only the i-th observation worth of data, call the constrain_parameters function for each of the S draws, and return that vector.


#3

Unfortunately, I haven’t saved all of the transformed parameters I would need. Looks like I’ll have to restructure the code and rerun. I see now with this example that the method I was using previously with loo prior to 2.0.0 will not work with the current loo. Here is the test I used:

# Simulate data and draw from posterior
N <- 50; K <- 10; S <- 100; a0 <- 3; b0 <- 2
p <- rbeta(1, a0, b0)
y <- rbinom(N, size = K, prob = p)
a <- a0 + sum(y); b <- b0 + N * K - sum(y)
fake_posterior <- as.matrix(rbeta(S, a, b))
dim(fake_posterior) # S x 1
fake_data <- data.frame(y,K)
dim(fake_data) # N x 2

llfun <- function(data_i, draws) {
  # each time called internally within loo the arguments will be equal to:
  # data_i: ith row of fake_data (fake_data[i,, drop=FALSE])
  # draws: entire fake_posterior matrix
  dbinom(data_i$y, size = data_i$K, prob = draws, log = TRUE)
}

log_lik_matrix <- sapply(1:N, function(i) {
  llfun(data_i = fake_data[i,, drop=FALSE], draws = fake_posterior)
})

ll_fun <- function(data_i, draws) {
  data_i*draws
}

loo(ll_fun, data = t(log_lik_matrix), draws = rep(1, S))

Which the last statement returns this error:

Warning: Not enough tail samples to fit the generalized Pareto distribution in some or all columns of matrix 
of log importance ratios. Skipping the following columns: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ... [90 more not printed].

Error in log_weights[, n] : incorrect number of dimensions
In addition: Warning messages:
1: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and 
MCSE estimates will be over-optimistic. 
2: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

It’s not clear to me why this result occurs, but clearly I need to dig deeper into the loo documentation.


#4

I was able to write a function in R that calculates the log-likelihood from the base parameters. The base parameter list is a 6.3 GB R object which consist of 2500 posterior draws of several parameter vectors and matrices. This is a for a multi-state mark-recapture model with a very large number of states. Ultimately, the likelihood boils down to a giant multinomial model, but I use several transition probability matrices to store the transformed parameters because this makes for far more convenient coding. Those details shouldn’t matter except to point out that it a relatively large number of parameters required to calculate a log-likelihood.

I am able to successfully get loo_i to run using my custom function and I’ve verified it calculates the same log_lik values as I previously calculated in the generated quantities block.

However, attempting to run loo.function fails for me as surely as loo.matrix. I once again get the error code:

Error in serialize(data, node$con) : error writing to connection

Any advice? I was able to previously get loo estimates for this model with loo prior to 2.0.0.


#5

I would try it without using multiple cores in order to see what the real error message is.


#6

I was able to get it to run finally. I’m not sure if switching to a single core did the trick or if it was making sure to clear my environment of all unnecessary objects to free up RAM. With two stanfit objects of 15 GB and then extracted parameters draws of around 6.5 GB each in the environment, I was using up more than half of my RAM aleady. I removed the stanfit objects afters extracting what I needed and then tried again. It definitely gets tricky running loo with bigger objects.