Error in computing WAIC for a big dataset

loo

#1

Hi all,

I am computing WAIC for an IRT model. However I cannot compute it when I run the model for 5000 iterations because the dataset is big: 13000 observations * 7 items.

It means that the running has to save or store a large matrix of size 91000 * 2500. When I run in my laptop, windows asks me to stop because of exceeding RAM memory. When I run in a HPC (High Performance Computer) I get an error:

Error in vapply(out, "[[", 2L, FUN.VALUE = numeric(1)) :
  values must be length 1,
 but FUN(X[[1]]) result is length 0
Calls: loo -> loo.matrix -> psislw -> vapply
Execution halted

I guess the reason is the size of the big matrix because when I run with 20 iterations it is fine.

Do you have any idea to overcome the problem, how can I compute WAIC for a big dataset but with a small laptop or HPC with limited RAM?

Thanks,
Tran.


#2

Hi Tran,

This may be related: https://github.com/stan-dev/loo/issues/35

Have you tried specifying the cores = 1 argument to the loo() function?


#3

Since the dataset is this big, it’s possible that WAIC might be something that you don’t want to use or you don’t need it. I’m not familiar with IRT model. but based on wiki I assume there might be several persons and for each person several observations. If you have this kind of hierarchical structure you might be more interested in predicting the performance for a new person, instead of predicting the performance of answering a question by already observed person. If you don’t have that kind of hierarchy then it’s likely that you have very high number of observations per parameter and WAIC is not necessarily needed. If you tell more about your model and data I can give further comments.

If the number of observations per parameter is high, you might

so could you tell how many parameters do you have


#4

Hi @avehtari,

If I just fit one model, then I may not need to compute WAIC. However, in my case, it is of interest to compare different models. As DIC and some other IC’s, the idea here is to compute WAIC and base on that to select a better model. For simulation study, I know the better model and I would like to use WAIC to say that WAIC choose correct model. From that WAIC for a real analysis is expected to choose a better model. That is why I need WAIC.

My dataset has hierarchical structure, where 13199 individuals and 7 variables per individual, giving total 92393 observations. The number of parameters is about 53505 (including the latent variables).

So how can I proceed?

Kind regards,
Tran.


#5

Thanks for the clarification. You do need to use WAIC or cross-validation for model comparison (with much smaller number of parameters you could have done something simpler).

Did you mean to write 7 observations per individual?

Remember that WAIC corresponds to predicting a new observation for the same individuals. If you want to predict for new individuals, I would recommend using k-fold-CV with, e.g., k=10, and keeping observations from individuals together when making the data division. k-fold-CV would be also otherwise easy for your case. If you really want to predict new observations for these same individuals I would recommend PSIS-LOO instead of WAIC, as it has diagnostics (Pareto shape khat) to tell when it fails (and in that case WAIC would fail, too). Because of the size of your data, you probably need to write log_lik computation outside of Stan, so that you can compute log_lik values for a smaller set of data and then loop over the whole data.


#6

Thank you so much for your suggestion!

I am waiting whether the super computer (HPC) can overcome the memory problem. Otherwise i will follow your suggestion.

Just a minor note, I think in the loo package to compute WAIC, smaller WAIC means better model (as with other IC’s), is it correct?

Kind regards,
Tran.