Model comparison- big data

Hi all,

I am working on a longitudinal ordinal data where each subject is followed for about 450 days (though unbalanced) and have about 6000 subjects. I am fitting ordinal probit model using brms. the model is specified in brms as follow

base_mode<-brm(formula=formula[[g]], 
               data = data1,
               family=cumulative("probit"),
               sample_prior = "yes",
               prior=prior[[g]],chains = 4,core=4, iter = 2000,
               seed=seed1,autocor=NULL,inits=0,control=list(adapt_delta=0.95),
               algorithm="sampling")

I want to follow a ground up modelling approach where I will first determine the functional forms of my covariate (liner/ smooth) and then include them in model one at a time using the functional form determined from the previous step. I would like to base my model comparison using “loo” but due to the size of the data I have, I couldn’t be able to compute loo (it runs in 64 GB machine for a week and killed by the system). I follow the discussion here (Error in computing WAIC for a big dataset) and but couldn’t be able to understand the suggestion by Prof Aki-

“Because of the size of your data, you probably need to write log_lik computation outside of Stan, so that you can compute log_lik values for a smaller set of data and then loop over the whole data.”

I feel like this is what i have to do but need more detail description about this suggestion. Any help please?

2 Likes

Hi @belay,

Do you have one observation per day or something else? What is your model in time?

How many covarites do you have? If you have many covariates compared to the number of observations, this kind of forward selection may overfit (Piironen and Vehtari, 2017).

I think brms does log_lik computation outside of Stan. @paul.buerkner can confirm this and comment if there is a way to make this use less memory.

For this many observations, you could also compute random sample of the loo rounds and use those to compute an estimate for full loo. This is something projpred package is already using. For brms/loo packages this requires a bit if coding, so I created also an issue.

Thank you @avehtari

Do you have one observation per day or something else? What is your model in time?

Yes, I have one observation per day for each individual. Each individual will have at most 450 and minimum of 2 observation. The response is pain level which is measured in ordinal scale.

How many covarites do you have? If you have many covariates compared to the number of observations, this kind of forward selection may overfit (Piironen and Vehtari, 2017).

In the current analysis, I have a limited covariates (9 covariates to be specific- 2 time invariant and 7 individual level time varying).

Thank you for creating the issue and will keep my eye on https://github.com/stan-dev/loo/issues/87

Belay

Hi @avehtari

I am afraid I might misunderstood your suggestion. Here is what I tried. I took a random sample from the data I used to fit the model (let’s name it sampdata). Then, I compute the log_lik using this sample data as follow

loglik_mod<-log_lik(base_mode,newdata=sampdata,allow_new_levels=T)

Then, I compute the relative effective sample size using

r_eff_mod<-relative_eff(exp(loglik_mod),chain_id=rep(1:4,each=2000))

Finally compute loo using
loo(loglik_mod,r_eff=r_eff_mod)

Do you think I followed the right approach?

Looks right. Use the same sample data for each model when comparing the models. You can test this approach with some smaller data for which you can compute the full loo. If your individual elpd_i’s do not have very skewed distribution (happens in case of model misspecification) then, and the size of the sample is m, then the subsample loo multiplied by n/m should be a reasonable estimate for full loo.

Thank you very much. I will do the cross checking