Filesize error in loo

Hi All,

I fitted a behavioral data set with 150 subjects and 700 trials each using an hierarchical RL model with stan (code). This gets me from the transformed parameters block a log_like matrix (subjects x trials), which I then try to use with loo.

rl_fit<- stan(file = my_model, 
              data=data_for_stan, 
              iter=2000,                         
              chains=4,
              cores =4, 
              save_warmup=F)

loo_1<-loo(rl_fit,   
           pars = "log_lik",    
           save_psis = FALSE,   
           cores = 4,   
           moment_match = FALSE,  
           k_threshold = 0.7)

I then get this:

Error: cannot allocate vector of size 3.8 Gb
In addition: Warning message:
Some Pareto k diagnostic values are too high. See help(ā€˜pareto-k-diagnosticā€™) for details.

Any idea whats going on and how I can fix it?

Many thanks,
Nitzan

Iā€™m by far not the most solid help you might find on this forum, but as your question has gone unanswered for some time: The error seems just to say that the size of the output from loo exceeds what R can allocate to memory. That is usually not a problem, so it might signal something is not right.

I had a brief look at your code, and I couldnā€™t quite figure out how the log_lik matrix coming out of your code is appropriate to use for the loo function. Are you confident that itā€™s the right quantity? Usually, itā€™s calculated in the generated quantities block, as the log-probability of the observations conditional on the parameters.

Also, what are you trying to use loo for? Estimating the predictive precision of your model for a new subject, or something else?

Thank you so much for this. I am trying to use loo for model comparison (at this point only in simulations).

So on the more technical side, the loo documentation notes that the first argument can be a ā€œA log-likelihood array, matrix, or functionā€, so I figured this should be okā€¦ (but obviously Iā€™m doing something wrong).

  1. I am estimating log_lik for N subjects with T trials and I was wondering whether it might be appropriate/helpfull to aggregate over trials (and then have an N vector with the sum of log_lik). Do you have any idea whether this should make a difference for psis-loo estimation?

  2. Yes - I can def calculate log_lik in the generated quantities block. But will it make a difference in the actual estimates (other then speeding up the code - which is also very useful)?

Much appreciated.
Nitzan.

I canā€™t really follow what your model is doing, but as long as you are getting reasonable fits to simulated data and understand it yourself, I guess all is well. And apart from the computational gain, I donā€™t think there is a specific reason to have it in the generated quantities. Also, a lot of (most?) models arenā€™t set up with the log-likelihood of the data as a parameter.

Anyway, as it stands now, I think you are doing ā€œleave-one-trial-outā€, which estimates the predictive utility of your model for a new trial, given all the other observations for a subject. That may or may not be what you want. Summing the log-likelihoods over trials would give you leave-one-subject out, but there are often issues with loo for hierarchical models at the subject level.

If you havenā€™t looked at it already, there is a lot of useful information in the Cross-validation FAQ .

1 Like

Thank you for this. I will try to dig deeper to understand whether it make sense to do subject-wise loo which will probably be easier to implementā€¦
Best,
Nitzan