Hey,
I have a dataset with continuous predictor variables and a binary predictor variable from which I would like to predict binary data (i.e., yes/no).This is implemented in a noncentered robust hierarchical logistic regression where the lower levels are ‘trials’ and the higher levels are ‘participants’. I also include interaction terms between continuous and categorical predictors, with a single intercept term across all predictor variables (one beta coefficient for each participant and predictor drawn from group-level coefficient distributions). See below for model details.
My issues/questions are two:
- The target event for the binary predictor variable is rare (roughly 6 events out of 60 possible) and the baserate of the binary dependent variable is also rare (roughly 10-20%). Additionally, the baserate of the dependent variable is related to the categorical predictor. I.e., the categorical predictor being ‘true’ increases the rate at which the binary DV true is (by a moderate amount, an increase of ~20-30).
My concern is that the low base rate of both the binary predictor variable and the DV may result in a bias in favour of explaining the high rate “non-events” over the more theoretically interesting “events”. Clearly, in an ideal world the design would be completely balanced (e.g., 50% base rate), but in this case it is not feasible due to design constraints.
‘Hacky’ ways of ‘solving’ this include only modelling ‘interesting trials’ (foregoing the possibility of modelling interactions - which would resolve the low base rate categorical predictor [only ‘events’ modelled], but not the low DV base rate), or post-hoc sub-sampling un-interesting trials, but these and other ‘solutions’ aren’t particularly clean.
What are your thoughts on this? Is it an issue? If so, can it be solved with a different model specification?
- I would also like to say something about model fit and compare different version of the models (e.g., with different number of predictor variables).
In addition to the above concern (i.e., bias), conceptually the most-advocated-for model fit metrics (afaik) seem problematic here (i.e., LOO). The interest is not so much in asking about model generalisation to any arbitrary data point, but rather to test generalisation from data given beta coefficients within participants.
If I were running a MLE leave-one-out (which I’m more used to), I would simply fit models at the individual level and leave single trials (or more) out and assess generalisation within participants.
Additionally, in various places here and elsewhere I’ve come across warnings about using LOO for hierarchical models and/or using LOO for model selection among similar models (e.g., regression models with different numbers of predictor variables).
Any thoughts / recommendations on the issue of evaluating fit?
data{
int<lower=1> N;
array[N] int<lower=0, upper=1> DV;
array[N] real A ;
array[N] real B ;
array[N] real C ;
array[N] int<lower=1> vectsubj;
array[N] real BbyC;
array[N] real AbyC;
int<lower=1> N_subj;
}
parameters{
array[N_subj] vector[6] betaM_raw;
vector[6] betaMU;
vector<lower=0>[6] betaSD;
}
transformed parameters{
array[N_subj] vector[6] betaM;
for (i in 1:N_subj){
betaM[i] = betaMU + betaSD .* betaM_raw[i];
}
}
model{
array[N] real mu;
betaMU ~ student_t(3,0,1);
betaSD ~ normal(0,1);
for (i in 1:N_subj){
betaM_raw[i] ~ normal(0,1);
}
for(n in 1:N){
mu[n] = betaM[vectsubj[n],1] + betaM[vectsubj[n],2]*A[n] +betaM[vectsubj[n],3]*B[n] + betaM[vectsubj[n],4]*C[n] + betaM[vectsubj[n],5]*BbyCn] + betaM[vectsubj[n],6]*AbyC[n];
}
DV ~ bernoulli_logit(mu);
}
generated quantities{
array[N] real mu;
array[N] real log_lik;
for(n in 1:N){
mu[n] = betaM[vectsubj[n],1] + betaM[vectsubj[n],2]*A[n] +betaM[vectsubj[n],3]*B[n] + betaM[vectsubj[n],4]*C[n] + betaM[vectsubj[n],5]*BbyC[n] + betaM[vectsubj[n],6]*AbyC[n];
log_lik[n] = bernoulli_logit_lpmf(DV[n] | mu[n]);
}
}