I am working on a longitudinal ordinal data where each subject is followed for about 450 days (though unbalanced) and have about 6000 subjects. I am fitting ordinal probit model using brms. the model is specified in brms as follow
data = data1,
sample_prior = "yes",
prior=prior[[g]],chains = 4,core=4, iter = 2000,
I want to follow a ground up modelling approach where I will first determine the functional forms of my covariate (liner/ smooth) and then include them in model one at a time using the functional form determined from the previous step. I would like to base my model comparison using “loo” but due to the size of the data I have, I couldn’t be able to compute loo (it runs in 64 GB machine for a week and killed by the system). I follow the discussion here (Error in computing WAIC for a big dataset) and but couldn’t be able to understand the suggestion by Prof Aki-
“Because of the size of your data, you probably need to write log_lik computation outside of Stan, so that you can compute log_lik values for a smaller set of data and then loop over the whole data.”
I feel like this is what i have to do but need more detail description about this suggestion. Any help please?
Do you have one observation per day or something else? What is your model in time?
How many covarites do you have? If you have many covariates compared to the number of observations, this kind of forward selection may overfit (Piironen and Vehtari, 2017).
I think brms does log_lik computation outside of Stan. @paul.buerkner can confirm this and comment if there is a way to make this use less memory.
For this many observations, you could also compute random sample of the loo rounds and use those to compute an estimate for full loo. This is something projpred package is already using. For brms/loo packages this requires a bit if coding, so I created also an issue.
I am afraid I might misunderstood your suggestion. Here is what I tried. I took a random sample from the data I used to fit the model (let’s name it sampdata). Then, I compute the log_lik using this sample data as follow
Looks right. Use the same sample data for each model when comparing the models. You can test this approach with some smaller data for which you can compute the full loo. If your individual elpd_i’s do not have very skewed distribution (happens in case of model misspecification) then, and the size of the sample is m, then the subsample loo multiplied by n/m should be a reasonable estimate for full loo.