Use derived quantity from one model as response variable in another

Hello! I am wondering the best way to use a derived quantity from a one model as a response variable in a second model (while maintaining the error associated with each value). Specifically, I have a simple linear regression as my first model and then I take a new dataset and calculate the difference between observed values of y from predicted values of y based on the model parameters. I then want to use these differences as a response variable in a new model: a simple ANOVA-style where I look at delta y ~ treatment. Is it possible to do this in a joint model in STAN? How would I code it? I have my current code below and I added the code for the second model in a second model statement at the bottom. Thank you for any advice!


data {
 int < lower = 1 > N; // Sample size for control regression
 int < lower = 1 > n_treatment; // number of treatments
 int < lower = 1 > n_PlateID; // number of plates
 vector[N] y; // Response
 vector[N] x; // Predictor
 vector[N] x2; // Covariate
 int < lower = 1 > n_resid; // Sample size for residual model
 int<lower = 1, upper = n_treatment> Treatment[n_resid]; // Treatment variable
 vector[n_resid] PlateID; // PlateID for random effect of plate
 vector[n_resid] AllPR; // All final PR
 vector[n_resid] AllInitial; // All initial PR
 vector[n_resid] AllpHmin; // All pHmin
 

}

parameters {
 real alpha; // Intercept
 real beta; // Slope (regression coefficients for x)
 real beta2; // Slope for x2
 real < lower = 0 > sigma; // Error SD
 real < lower = 0 > sigma_resid; // Error SD for residual model
 real beta_resid; // beta for second model
 real alpha_plate; // intercept for second model

}

model {

// conditional mean
vector[N] mu;

// linear combination
 mu = alpha + x * beta + x2*beta2;

// likelihood function
 y ~ normal(mu, sigma);
 }

generated quantities {
vector[N] y_rep;
vector[n_resid] PR_expected;
vector[n_resid] PR_Resid;
vector[n_resid] y_resid;


 for (i in 1:N) {
 // generate predictions
 real y_hat = alpha + x[i] * beta + x2[i] *beta2; // Posterior predictions

 // generate replication values
 y_rep[i] = normal_rng(y_hat, sigma);
} // The posterior predictive distribution

// calculate predicted PR based on pH from plate data
for (i in 1: n_resid){
 PR_expected[i] = alpha + AllpHmin[i] * beta + AllInitial[i]* beta2; // calculate the expected value
 PR_Resid[i] = AllPR[i] - PR_expected[i]; // calculate the residuals
 
}

}

// This is the second model statement where  PR_Resid from above is the new y value 
model {
// conditional mean
 real mu_resid;

 // generate model for residuals ~ treatment 
 
 mu_resid = alpha_plate + beta_resid*Treatment[i];
// likelihood 
  PR_Resid ~ normal(mu_resid, sigma_resid);

}

Given the current version of your code, you could compute PR_Resid,
PR_expected, and mu_resid in the transformed parameters block and include the PR_Resid ~ normal(mu_resid, sigma_resid); in the model block. This would model everything jointly. I am a little unsure about the precise meaning of:

Jointly modelling the quantities will change the posterior for the parameters associated with the first model, so the posterior for the residuals will change compared to fitting the submodels separately (as they should).

I agree with @hhau that this is a way on including uncertainty in parameters in a new model using the results from a previous run, I’d just note that you would be approximating some parameter using the summaries of the posterior distribution, and mimicking the uncertainty in it (and even if you did a very close approximation or used a custom empirical distribution that matched the posterior that would still be the case). So I probably disagree with the statement that “this would model everything jointly”, but in practice it doesn’t make a difference, becase that is the way to go.

On a different note, if you are using including the uncertainty in some parameter to do further inference using the same data, you should be aware of the the risk of double dipping – using the same data more than once doesn’t necessarily imply double dipping, but it may bias the results into one direction or another if, for instance, both models behave similarly with respect to the data.

Thank you both! I tried moving those parameters into the transformed parameters block as shown below, but now I get the following error.


data {
 int < lower = 1 > N; // Sample size for control regression
 int < lower = 1 > n_treatment; // number of treatments
 int < lower = 1 > n_PlateID; // number of plates
 vector[N] y; // Response
 vector[N] x; // Predictor
 vector[N] x2; // Covariate
 int < lower = 1 > n_resid; // Sample size for residual model
 vector[n_resid] PlateID; // PlateID for random effect of plate
 vector[n_resid] AllPR; // All final PR
 vector[n_resid] AllInitial; // All initial PR
 vector[n_resid] AllpHmin; // All pHmin
 vector[n_resid] Treatment; // Treatment
 

}

parameters {
 real alpha; // Intercept
 real beta; // Slope (regression coefficients for x)
 real beta2; // Slope for x2
 real < lower = 0 > sigma; // Error SD
 real < lower = 0 > sigma_resid; // Error SD for residual model
 real beta_resid;
 real alpha_plate;

}

transformed parameters {
vector[n_resid] PR_expected;
vector[n_resid] PR_Resid;


// calculate predicted PR based on mixing from plate data
for (i in 1: n_resid){

 PR_expected[i] = alpha + AllpHmin[i] * beta + AllInitial[i]* beta2; // calculate the expected value
 PR_Resid[i] = AllPR[i] - PR_expected[i]; // calculate the residuals

 
}



}

model {

// conditional mean
vector[N] mu;
vector[n_resid] mu_resid;

// linear combination
 mu = alpha + x * beta + x2*beta2;

// likelihood function
 y ~ normal(mu, sigma);
 

 // generate model for residuals ~ treatment with plateID as random
 
 mu_resid = alpha_plate + beta_resid*Treatment;
 

 PR_Resid ~ normal(mu_resid, sigma_resid);

 }

generated quantities {
vector[N] y_rep;
vector[n_resid] y_resid;


 for (i in 1:N) {
 // generate predictions
 real y_hat = alpha + x[i] * beta + x2[i] *beta2; // Posterior predictions

 // generate replication values
 y_rep[i] = normal_rng(y_hat, sigma);
} // The posterior predictive distribution


 
}

DIAGNOSTIC(S) FROM PARSER:
Info:
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    PR_Resid ~ normal(...)

PARSER FAILED TO PARSE INPUT COMPLETELY
STOPPED AT LINE 95: 
}


Info:
Left-hand side of sampling statement (~) may contain a non-linear transform of a parameter or local variable.
If it does, you need to include a target += statement with the log absolute determinant of the Jacobian of the transform.
Left-hand-side of sampling statement:
    PR_Resid ~ normal(...)

Error in stanc(file = file, model_code = model_code, model_name = model_name,  : 
  failed to parse Stan model 'stan_modelJamie' due to the above error.

I didn’t catch it the first time, but you have two model blocks, and in the second version you have only one, but seem to be trying to run both analysis with the same .stan model, am I getting that right?
Using the posteriors as inputs by approximating them using some prior distribution would have to be done in two separate programs, and in between you would have to extract the results from the first run and computing whatever quantities which you would then put through the second (via data block).

Yes, that is correct. I was trying to avoid doing this suggestion because if I say compute the mean of the posterior from the first model run and use it as data in a second model run then I lose the error associated with each data point because then it is just a single value. Or am I thinking about this all wrong? Thank you so much for talking the time to help.

hello @njsilbiger ! perhaps this is something like what you want

data {
  int < lower = 1 > N; // Sample size for control regression
  int < lower = 1 > n_treatment; // number of treatments
  int < lower = 1 > n_PlateID; // number of plates
  vector[N] y; // Response
  vector[N] x; // Predictor
  vector[N] x2; // Covariate
  int < lower = 1 > n_resid; // Sample size for residual model
  array[n_resid] int<lower = 1, upper=n_treatment> Treatment; // Treatment variable
  vector[n_resid] PlateID; // PlateID for random effect of plate
  vector[n_resid] AllPR; // All final PR
  vector[n_resid] AllInitial; // All initial PR
  vector[n_resid] AllpHmin; // All pHmin


}

parameters {
  real alpha; // Intercept
  real beta; // Slope (regression coefficients for x)
  real beta2; // Slope for x2
  real < lower = 0 > sigma; // Error SD
  real < lower = 0 > sigma_resid; // Error SD for residual model
  vector[n_treatment] beta_resid; // beta for second model
  real alpha_plate; // intercept for second model

}

model {

  // conditional mean
  vector[N] mu;

  // linear combination
  mu = alpha + x * beta + x2*beta2;

  // likelihood function
  y ~ normal(mu, sigma);

  // and another likelihood, this time for AllPR

  vector[n_resid] mu_pr;

  mu_pr = alpha + AllpHmin * beta + AllInitial* beta2 + alpha_plate + beta_resid[Treatment];

  AllPR ~ normal(mu_pr, sigma_resid);

}

generated quantities {
  vector[N] y_rep;
  vector[n_resid] PR_expected;
  vector[n_resid] PR_Resid;
  vector[n_resid] y_resid;


  for (i in 1:N) {
    // generate predictions
    real y_hat = alpha + x[i] * beta + x2[i] *beta2; // Posterior predictions

    // generate replication values
    y_rep[i] = normal_rng(y_hat, sigma);
  } // The posterior predictive distribution


}

Some things I tweaked

  • You can have a model with two likelihoods at the same time! so here I’m modelling AllPR and also y at the very same time. This means that the parameters will mutually inform each other – values of, say, beta2 will respond to both values of y and AllPR. But I’m guessing that the sample sizes are different between the two models? So the one with the bigger sample size (and or smaller variance) will have more influence
  • I vectorized the expression for AllPR, since everything is vectors of size n_resid
  • i think you wanted beta_resid to be a vector, with one effect for every Treatment. So I make it a vector and used Treatment to subset it

I also had a question about alpha_plate. In theory residuals should be centered on 0 – do you need to have alpha_plate at all?

Anyway hope something there is useful!

Hi Andrew, Thank you so much for this! The model you suggested is almost what I am looking for. The biggest difference is that my response variable for the second model is actually what I call “PR_Resid” in the example code (essentially, AllPR (measured values from a new dataset) minus the expected PR value at a given pH), essential the residual of the model. Then, in the second model, I am testing if the residuals differ by treatment. I believe what you have above is testing if the expected value differs by treatment.

Thank you for the update on beta_resid. That is exactly what I am looking for there! And for your question about alpha_plate, I just have that in as a holder for right now because I am going to make that a random effect of site once I get past this first hurdle. Thank you again for your help!

There are two different possibilities being discussed here:

  1. Using the difference between the prediction and the data (i.e. “error”) within each sample
  2. Using uncertainty in the parameters from one model posterior (i.e. using all samples) as prior information for the second

#1 is what you original code implies, and what @aammd suggested. The potential issue here is that if both models are similar is that, by using this difference (and therefore implementing both models under one Stan program) you are requiring both models to fit, and therefore constraining parameters to values that are potentially quite different from what would best fit each model individually.

With #2, if you compute just the mean of the posterior you will ignore uncertainty, but not if you compute mean and variance that could be used as a prior; that’s what the PRresid ~ normal(mu_resid, sigma_resid) // doesn't have to be normal is for. Since you’d be summarizing the posterior and approximating it with another distribution you may lose the difference between prediction and data in each sample, although you don’t have to lose uncertainty completely since you could compute the range of the difference between prediction and data (e.g. y_i - \mu, y_i - [\mu + \sigma], y_i - [\mu - \sigma]).
However, option #2 is probably better suited to take into account the uncertainty in the parameters, not the prediction error. The potential issue here is biasing the results by using the data more than once in a way that “forces” it further in the same direction as the first model.

Going back to your original task…

what is better depends on what you are trying to achieve, and what the limitations are for each option.