Informed prior for regression with new variables

I have two datasets, one having ~3500 individuals and 20 covariates, and the other having ~1200 individuals and the same 20 covariates, plus some additional ones. Right now our idea is to develop a prediction model using the first dataset, and then evaluate the performance of this model using the second dataset. We’d then like to build a model to see if any of the additional variables (which are expensive to measure) on the second dataset are worth investigating in more detail in future datasets. Is there a principled way in Stan/BRMS to incorporate the information from model 1 on the 20 covariates into model 2? I’m also open to other ideas.

I’m not sure what you’re trying to do here. You can certainly fit a model to one data set then evaluate on a held out data set. Part III of the user’s guide goes over how to do held-out evaluation. But it sounds like what you want to do is compare multiple models based on different subsets of predictors. For that, you need to look into model comparison, which is too involved to describe in an email. It’s described in Gelman et al.'s BDA3 and also in the papers around loo written by Jonah Gabry, Aki Vehtari et al. This is something where there’s a lot of differing opinion in the field. For instance, we generally prefer posterior predictive evals to prior predictive evals, hence we don’t bother with Bayes factors. When comparing model behavior, we tend to use cross-validation, which is based on posterior predictive evals, rather than Bayes factors, which are prior predictive evals (because the Bayes factors are super sensitive to priors even if the posterior with the data at hand isn’t sensitive to the prior).

@Bob_Carpenter thanks so much for my response–I don’t think my original post was very clear. In essence, I have a dataset where I have 3,000 subjects with just standard covariates X1 (say age, BMI, blood pressure) and an outcome. I then have another 1,000 subjects with those same standard covariates, plus some additional covariates (X2) that are expensive to measure (say gene expression), as well as the outcome. My idea was to first fit a model regressing y on X1 with the first 3,000 subjects, then use the posterior distributions for the coefficients from these model as a prior for a model regressing y on X1 + X2 using the 1,000 subjects with X2 measured.