LOO-CV for different models and different data

Hi there! I’m using LOO-CV. I know it can be used to compare two different models which fit the same data set, but can it be used to compare two models which fit different datasets?

My Stan model for multivariate linear regression is:

data {
 int<lower=1> K; 
 int<lower=0> N; 
 matrix[N,K] x;  // data matrix
 vector[N] y;    
 int<lower=0> Ntest;
 matrix[Ntest,K] xtest;
 vector[Ntest] ytest;

parameters {
 vector[K] beta;
 real<lower=0> sigma;

model {
 vector[N] mu;
 mu = x * beta;
 y ~ normal(mu, sigma);
 beta ~ normal(0., 10.);
 sigma ~ cauchy(0., 10.);

generated quantities {  
   vector[Ntest] logLikelihood;
   vector[Ntest] mu;
   mu = xtest*beta;
   for (i in 1:Ntest) {      
         logLikelihood[i] = normal_lpdf(ytest[i]| mu[i], sigma);

I have different variables X_1, X_2, X_3, X_4,… and I want to predict the value Z using combinations of two variables X_i,X_j. I obtain better predictions for the values of Z using calibrations of the type Xi = f(X_j,Z) = a + bX_j + cX_j^2 + dZ + eX_jZ and then isolating the Z value, instead of using calibrations like Z = g(X_i,X_j).

The problem is then that, like I’m using different dependent variables in different calibrations (different X_i), it makes no sense comparing them using LOO-CV. But, will it be OK comparing different calibrations of the type Z = g(X_i,X_j), even when each calibrations use different independent variables, but all of them have the same dependent variable Z? For example, compare the models Z = g(X_1,X_2), Z = g(X_3,X_4), Z = g(X_1,X_3), etc.

Looking forward to your advice.

I don’t understand “calibrations of the type” and “then isolating the
Z value”. Can you elaborate?

First you wrote “I want to predict the value Z using combinations of two variables
Xi, Xj”. If the predicted Z is always the same then you can use LOO-CV. But I don’t know what you mean by calibrations, so if you clarify that I can provide better advice.

I’m interested in estimating the Z value, known two X_i,X_j, for example, X_1,X_2. But the model is much faster and I obtain better predictions for Z if I use the model X_1=a+bX_2+cX_2^2+dZ+eX_2Z to estimate the parameters a,b,c,d,e and then predict Z as Z=(X_1-a-bX_2-cX_2^2)/(d+eX_2), instead of using the model Z=f(X_1,X_2) even when I try directly the model Z=(X_1-a-bX_2-cX_2^2)/(d+eX_2) in the model block in Stan. In other words, I predict better values of Z if I use X_1 as y in the data block and being x a matrix with X_2,X_2^2,Z,..., instead of using directly Z as y in the data block.

The problem is then that I obtain different models, for example:
where the parameters a,b,c,d,e are different in each model. Like the independent variable is different in each model (X_1, X_2, X_3), I cannot compare them directly using LOO-CV.

Keeping in mind that I only use these models to estimate the parameters a,b,c,d,e, but then I estimate the values of Z like:
Is there any way that I can compare the different predicted values of Z using these models with LOO-CV? Is there any other better criteria to compare them?

Thank you for your attention and time.

This seems to be more complicated than what can be answered shortly.

What do you mean by faster model? How do you evaluate the predictions?

By “estimate” do you mean sample from the posterior, and use those posterior draws to predict?

There is no context for the equations you are using, and thus it’s difficult to comment on the sensibility of the approach. Also as you are using normal distribution for the residual, there is no guarantee that it would be a good model in both directions. There is no information on the scale of Xs and Z, and thus it’s not clear whether priors are sensible in both directions.

I don’t think this is a correct Bayesian way, but you can use cross-validation (including LOO-CV) for ad hoc approach, too. You just have to have the cross-validation around the both steps, which is easiest if you do K-fold-CV and just repeat the whole process for each fold.

Technically the above answers your question, but if you want you can tell more about your modeling problem and we can try to come up with better models. One way to proceed would be: 1) make posterior predictive checking plots for all these models you have, and check whether normal assumption is sensible, 2) compare priors and posterior for potential data-likelihood conflicts, 3) explain why this specific equation is used.