K-fold CV in linear regression model with varying intercepts and very small clusters


I’m working on model comparison with K-fold CV, but I cannot figure out how to handle some varying intercepts. A little background: I’m modeling conflict with parents among depressed adolescents, with various forms of parent psychopathology as predictors. I also model the outcome variable via a 2pl IRT model. Below is the Stan code for the model without predictors, as set up for K-fold CV. As far as I can tell it’s working as it should. In simulation it recovers parameters reasonably well. I could share a simulated dataset if someone is curious.

As the unit of observation is the individual parent, I have included a varying intercept per adolescent. That means the size of the clusters is between 1 and 2. The problem I can’t figure out is whether or not to include the varying intercepts in the calculation of the pointwise log-likelihood of the held-out data (as I am currently doing). When one of these tiny clusters is held out from fitting the model, their varying intercept posterior is simply samples from the prior, after all. Wouldn’t that make the elpd unreasonably low, as the varying intercept without any data is injecting noise into the predictions?

As I’m writing this, I’m starting to think that the varying intercepts are only important for the fitting process, to handle the non-independent observations when estimating, and that checking model fit to the held-out data could be done without the varying intercepts. But I’m a clinical psychologist with pretty sparse formal matemathical/statistical training, so these are just my intuitions. While I’ve had a pretty steep learning curve lately, I don’t trust my intuition just yet.

Another thing I’ve asked myself is whether I should make the varying intercepts conditional on there being two parents observed, as well. But that doesn’t really change anything as far as the K-fold CV problem is concerned.


  int<lower=1> N;
  int n_id;
  int<lower=1> id[N];
  int<lower=0, upper=1> mother[N];
  int cbq_items;
  // index vector for cases with no missing data on CBQ
  int ind_obs_cbq[N]; 
  int nobs_cbq;
  // index vector for cases with single items missing on CBQ
  int ind_mis_cbq[N];
  int nmis_cbq;
   // index vector for cases with no response to CBQ
  int ind_no_cbq[N];
  int nno_cbq;
  // matrix of observed complete CBQ data
  int<lower=0, upper=1> cbq_obs[nobs_cbq,cbq_items]; 
  // matrix of observed CBQ data with missing items
  int<lower=-1, upper=1> cbq_mis[nmis_cbq,cbq_items];
  int kfold_n; //number of cases the model is fitted to in this fold
  int kfold_ind[kfold_n]; //index of cases used for fitting model
  int fold; //current fold
  real a;
  vector[n_id] a_var;
  real<lower=0> sigma;
  vector[cbq_items] cbqbeta;
  vector[cbq_items] cbqbetaM;
  vector<lower=0>[cbq_items] cbqalpha;
  vector[nobs_cbq] cbqtheta_obs;
  vector[nmis_cbq] cbqtheta_mis;
  vector[nno_cbq] cbqtheta_no;
  real<lower=3> nu;
  vector[N] y_raw;
  vector[N] var_a_raw = a_var[id];
  vector[kfold_n] y;
  vector[kfold_n] var_a;
  for(i in 1:N){
    if (ind_obs_cbq[i]>0) 
      y_raw[i] = cbqtheta_obs[ind_obs_cbq[i]];
    else if (ind_mis_cbq[i]>0) 
      y_raw[i] = cbqtheta_mis[ind_mis_cbq[i]];
      y_raw[i] = cbqtheta_no[ind_no_cbq[i]];}
  y = y_raw[kfold_ind];
  var_a = var_a_raw[kfold_ind];
  // priors
  a ~ normal(0,10);
  a_var ~ normal(a,3);
  sigma ~ cauchy(0,1);
  cbqtheta_obs ~ normal(0,2);
  cbqtheta_mis ~ normal(0,2); 
  cbqtheta_no ~ normal(0,2);
    // allowing for item difficulty to vary by parent gender
  cbqbeta ~ normal(0,2);
  cbqbetaM ~ normal(cbqbeta, 1);
  cbqalpha ~ gamma(2,0.5);
  nu ~ gamma(2,0.1);
  // irt model
  for(i in 1:nobs_cbq){
  		for (j in 1:cbq_items){
  			cbq_obs[i,j] ~ bernoulli_logit(cbqalpha[j]*(cbqtheta_obs[i] - cbqbeta[j] + cbqbetaM[j] * mother[i]));	
  for(i in 1:nmis_cbq){
      for(j in 1:cbq_items){
      cbq_mis[i,j] ~ bernoulli_logit(cbqalpha[j]*(cbqtheta_mis[i] - cbqbeta[j] + cbqbetaM[j] * mother[i]));
  // likelihood
 y ~ student_t(nu, a + var_a, sigma);
generated quantities{
vector[N] y;
vector[N] var_a = a_var[id];
vector[N] log_lik;
for(i in 1:N){
    if (ind_obs_cbq[i]>0) 
      y[i] = cbqtheta_obs[ind_obs_cbq[i]];
    else if (ind_mis_cbq[i]>0) 
      y[i] = cbqtheta_mis[ind_mis_cbq[i]];
      y[i] = cbqtheta_no[ind_no_cbq[i]];}
for (n in 1:N) log_lik[n] = student_t_lpdf(y[n] | nu, a + var_a[n], sigma);

Do you want to generalize your inference to other parents than those included in the data set?

Holding out one of these clusters corresponds to idea that you want to test whether your model has learned something which generalizes to parents not in the data set. That prediction can have very high uncertainty, but the model contains something useful if that prediction is better than prior prediction without conditioning on the data from other parents.

With just one or two observations per group each observation is likely to be very highly influential for corresponding group specific parameter, unless the prior for group specific terms is tight. Now it seems this term has quite wide prior a_var ~ normal(a,3); with fixed scale. If n_id is not very small, then it would be better to have a_var ~ normal(a,a_var_sigma); and e.g. real <lower=0> a_var_sigma and a_var_sigma ~ normal(0,1).

Thanks a lot! That cleared it up a lot for me. I guess I had overthought it - thinking clearly and practically about what the model is actually doing is the necessary remedy. I do want to generalize inference to other parents. And as n_id is not very large at 59, I guess it’s not very small either, so I will follow your suggestion of estimating a_var_sigma.

Thanks again, and god jul! (Happy holidays in Norwegian)