Lppd and k-folds CV with multivariate normal and trial-specific parameters

I am trying to do some comparison between two models, one more basic and one a little more complicated. The more basic model is below. But I believe my confusion applies to both models.

Neither of these models seems to work well with existing code for computing model comparison metrics. The loo R package, for example, just gives me warnings when I try to compute loo or waic. I suspect this is because of some unsatisfied assumptions.

So as an alternative, I’m trying to set up my own k-folds cross validation. I’m trying to model a set of 21 randomized trials in education. The modest amount of data make it practical for me to re-fit my model k times. For k, I’m using T, the number of trials. Some trials had more than one treatment arm, so rather than leaving out each individual treatment estimate, I’m leaving out entire trials. This choice is based on my reading of Chapter 7 of BDA3, in which Gelman and co. advise, for cross validation, divvying up data into “disjoint, ideally conditionally independent, pieces.” (p. 176)

But as I go to write up my own code to compute a bias-corrected k-folds CV metric, as outlined on p. 175 of BDA3, I realize there are several decisions I’m not sure how to navigate.

Both of my models, including the one below, rely on a multivariate normal likelihood function. To compute lppd, as outlined on p. 169 of BDA3, I was thinking all I have to do is use the log_lik generated quantity I have in my model. This log_lik is of course based on the multivariate normal and makes use of \theta, a vector of true mean parameters that generate the observed vector y of treatment estimates. To get lppd, then, with vector y treated as just a single observation, all I would need to do, it seems, is exponentiate log_lik, take the mean of it, then finally take the log.

But if that’s the right way to compute lppd in this context, it seems in tension with the rest of the terms needed for a bias-adjusted k-folds CV metric. In particular, to get lppd_kfolds-cv, I need to sum up the log mean posterior predictive densities for each trial t, conditioned on a posterior calculated without trial t. Here, I think the right posterior quantity on which to condition these predictive densities is what I call study[j], a posterior prediction for the effects of a new trial, with a single treatment arm, if all you know before running the trial is the type of study/intervention it will be, nothing else. (In my stan model, I use j to index the study/intervention types.)

Then of course, we need lppd_bar. Which involves, among other things, using the posterior computed without trial t to make posterior predictions for each trial, including t. But what posterior quantities do I condition on to make these posterior predictions? Just study[j]? This feels funny because we know we have, for all but the treatment estimate(s) from the left-out trial, a specific \theta_{ij} that likely does a better predictive job than just study[j]. Maybe the answer depends on what I’m trying to infer. I am trying to generalize to a larger population, so maybe study[j] is the only piece that’s relevant, not the \theta_{ij}.

To tie all this together, would it make sense to compute lppd treating each trial as a separate observation, computing a posterior predictive density for that trial’s estimates conditioned only on study[j] (though where the posterior for study[j] is conditioned on every available trial). And then for lppd_bar, likewise keep conditioning only on study[j], summing across the trials?

Or maybe this whole k-folds exercise is at odds with my models? Maybe that’s the real tension? I’m using a multivariate normal likelihood in the first place, with y as one big vector and \Sigma one big covariance matrix, because there likely is some covariance across trials. A few school districts ran more than one trial used in the data. So maybe I should be leaving out entire school districts, one at a time? I’m not sure. Hoping someone can point me towards the light.

data {
	int<lower=1> N; // number of findings (treatment estimates)
	int<lower=1> J; // number of studies (intervention types; trials nested within one of these)
	vector[N] y; // estimated treatment effects
	cov_matrix[N] Sigma; // covariance matrix of ys
	int<lower=1,upper=J> idx_study[N]; // studyid
	real<lower=0> alpha_sd; // variance prior on alpha
	real<lower=0> sigma_delta_sd; // variance prior on study effects
	real<lower=0> sigma_epsilon_sd_sd; // variance prior for heterogeneity in finding effects variences
}

parameters {
	real alpha; // average adjusted effect size
	real<lower=0> sigma_epsilon_sd; // SD of half normal distribution of sigma_epsilons
	real<lower=0> sigma_delta; // SD of study deviances
	vector[N] epsilon_nc; // non-centered epsilons
	vector[J] delta_nc; // non-centered deltas
	vector<lower=0>[J] sigma_epsilon_nc; // non-centered sigma epsilons
}

transformed parameters {
	vector[N] theta; // recomposed mean of individual finding
	vector[N] epsilon; // deviation from alpha+beta*se+a by finding
	vector[J] delta; // deviation from alpha_beta*se by study
	vector<lower=0>[J] sigma_epsilon; // SD of finding deviances, estimated separately for each study

  for (j in 1:J) {
    sigma_epsilon[j] = sigma_epsilon_nc[j]*sigma_epsilon_sd; 
    delta[j] = delta_nc[j]*sigma_delta;
  }
	for (i in 1:N) {
	  epsilon[i] = epsilon_nc[i]*sigma_epsilon[idx_study[i]];
	}
	theta = alpha + epsilon + delta[idx_study];
}

model {
	alpha ~ normal(0, alpha_sd);

	epsilon_nc ~ normal(0, 1);
	sigma_epsilon_nc ~ normal(0, 1);
	sigma_epsilon_sd ~ normal(0, sigma_epsilon_sd_sd); // alas, not a great name for another variance term

	delta_nc ~ normal(0, 1);
	sigma_delta ~ normal(0, sigma_delta_sd);

	y ~ multi_normal(theta, Sigma);

}

generated quantities {
	vector[N] y_hat; // posterior predictive check
	real study[J]; // overall posterior for each intervention type
	real log_lik; // log likelihood

	y_hat = multi_normal_rng(theta, Sigma);
	for (j in 1:J) {
	  study[j] = alpha + delta[j] + normal_rng(0, sigma_epsilon[j]); // posterior for a new district trying intervention of type j
	}
	log_lik = multi_normal_lpdf(y | theta, Sigma);
	
}

For more up to date recommendations see CV-FAQ

This is fine, but see also more recent discussion in CV-FAQ: 8 hierarchical models

Computing that bias correction is probably not worth the trouble. See CV-FAQ: 26 commenting on bias correction

See how to do the K-fold-CV in the vignette Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

1 Like

These links have all been super helpful, thank you for sharing.

2 Likes