Is there a way where one might conditionally define parts of the data block, such that a “test set” could be supplied to automatically generate predictions for the “test set” from a “training set”?
Granted I am fully aware that it’s probably better from a bayesian perspective to use all of your data to fit a model, but often the first question I get when presenting results is “did you validate this on a test set?”. So, out of curiosity I was wondering if it would be feasible to do allow my stan models to have an optional “test set”, but if I wanted to fit a model to all of my data I could use the same model without having to have two separate versions of whatever regression model I happen to be fitting.
This code does not work, since the data block doesn’t seem to allow conditional statements, but I think it should demonstrate what I have in mind. Are there any workarounds or “hacks” that might make such a model feasible? :
data {
int N; //the number of observations
int P; //the number of columns in the model matrix
real y[N]; //the response
matrix[N,P] X; //the model matrix
int test_set; //switch to indicate whether a test set is provided
if (test_set)
int Ntest;
vector[Ntest] yTest;
matrix[Ntest,P] Xtest;
}
parameters {
real Intercept;
vector[P] beta; //the regression parameters
real<lower=0> sigma;
}
model{
Intercept ~ normal(0, 1);
beta ~ student_t(3, 0, 2);
sigma ~ student_t(3, 0, 10);
y ~ normal(Intercept + X*beta, sigma);
}
generated quantities{
vector[N] ySim;
for(i in 1:N){
ySim[i] = normal_rng(X[i] * beta + Intercept, sigma);
}
if (test_set)
vector[Ntest] yTest;
for(i in 1:Ntest){
yTest[i] = normal_rng(Xtest[i] * beta + Intercept, sigma);
}
}