Model comparison- big data

belay · July 23, 2018, 1:27pm

Hi all,

I am working on a longitudinal ordinal data where each subject is followed for about 450 days (though unbalanced) and have about 6000 subjects. I am fitting ordinal probit model using brms. the model is specified in brms as follow

base_mode<-brm(formula=formula[[g]], 
               data = data1,
               family=cumulative("probit"),
               sample_prior = "yes",
               prior=prior[[g]],chains = 4,core=4, iter = 2000,
               seed=seed1,autocor=NULL,inits=0,control=list(adapt_delta=0.95),
               algorithm="sampling")

I want to follow a ground up modelling approach where I will first determine the functional forms of my covariate (liner/ smooth) and then include them in model one at a time using the functional form determined from the previous step. I would like to base my model comparison using “loo” but due to the size of the data I have, I couldn’t be able to compute loo (it runs in 64 GB machine for a week and killed by the system). I follow the discussion here (Error in computing WAIC for a big dataset) and but couldn’t be able to understand the suggestion by Prof Aki-

“Because of the size of your data, you probably need to write log_lik computation outside of Stan, so that you can compute log_lik values for a smaller set of data and then loop over the whole data.”

I feel like this is what i have to do but need more detail description about this suggestion. Any help please?

avehtari · July 24, 2018, 7:47am

Hi @belay,

Do you have one observation per day or something else? What is your model in time?

How many covarites do you have? If you have many covariates compared to the number of observations, this kind of forward selection may overfit (Piironen and Vehtari, 2017).

I think brms does log_lik computation outside of Stan. @paul.buerkner can confirm this and comment if there is a way to make this use less memory.

For this many observations, you could also compute random sample of the loo rounds and use those to compute an estimate for full loo. This is something projpred package is already using. For brms/loo packages this requires a bit if coding, so I created also an issue.

belay · July 24, 2018, 9:10am

Thank you @avehtari

Do you have one observation per day or something else? What is your model in time?

Yes, I have one observation per day for each individual. Each individual will have at most 450 and minimum of 2 observation. The response is pain level which is measured in ordinal scale.

How many covarites do you have? If you have many covariates compared to the number of observations, this kind of forward selection may overfit (Piironen and Vehtari, 2017).

In the current analysis, I have a limited covariates (9 covariates to be specific- 2 time invariant and 7 individual level time varying).

Thank you for creating the issue and will keep my eye on https://github.com/stan-dev/loo/issues/87

Belay

belay · July 24, 2018, 11:54am

Hi @avehtari

I am afraid I might misunderstood your suggestion. Here is what I tried. I took a random sample from the data I used to fit the model (let’s name it sampdata). Then, I compute the log_lik using this sample data as follow

loglik_mod<-log_lik(base_mode,newdata=sampdata,allow_new_levels=T)

Then, I compute the relative effective sample size using

r_eff_mod<-relative_eff(exp(loglik_mod),chain_id=rep(1:4,each=2000))

Finally compute loo using
loo(loglik_mod,r_eff=r_eff_mod)

Do you think I followed the right approach?

avehtari · July 24, 2018, 12:35pm

Looks right. Use the same sample data for each model when comparing the models. You can test this approach with some smaller data for which you can compute the full loo. If your individual elpd_i’s do not have very skewed distribution (happens in case of model misspecification) then, and the size of the sample is m, then the subsample loo multiplied by n/m should be a reasonable estimate for full loo.

belay · July 24, 2018, 12:43pm

Thank you very much. I will do the cross checking

Topic		Replies	Views
'problem too large' when doing model comparison using loo on fitted brms models brms loo , brms	13	1582	April 17, 2021
Speeding up multiple loo model comparisons in brms by using multiple cores? brms loo	7	2010	August 2, 2018
Model comparison - large model Modeling loo	5	816	March 8, 2018
How to speed up `brms::loo_subsample()` for large models brms loo , hierarchical-model , model-comparison	16	1645	October 25, 2022
Extract log-likelihood from large size stanfit using the Loo package General loo	4	2319	June 7, 2018

Model comparison- big data

Related topics