Can I propagate uncertainty by running in the model block a submodel for X that informs the main model for Y?

Is it possible to run two models in the block model and use the result of one as data to the other?

I aim to propagate uncertainty on a predictor x in a linear model (y=a+bx). I know how to do it by using mean and sd (see for example the Stan manual). However, I was wondering if I could use all the x data, model the mean of x for each group of interest (individuals in the following example) within Stan, and use this mean for the model predicting an outcome y. I wrote some Stan code that compiles, but encounter errors while running on data. I think I have issues with indexing… unless this is just not possible.

Here is my unglorious Stan code:

data {
  int<lower=0> N;                       // number of observations
  vector[N] y;                          // vector of observations
  int<lower=0> N_ind;                   // number of individuals
  array[N] int<lower=1> individual;     // index of individuals
  int<lower=0> N_x;                     // number of predictor observations
  vector[N_x] x;                        // predictor observations (multiple obs per individual)
  array[N_x] int<lower=0> individual_x; // index of individuals for x (index has to match with individual above)

parameters {
  vector[N_ind] x_bar;
  real<lower=0> sigma_x;
  real alpha;
  real beta;
  real<lower=0> sigma;

model {
  // model for mean x per individual
  x_bar ~ normal(0, 1);                     // prior
  sigma_x ~ exponential(1);                 // prior
  x ~ normal(x_bar[individual_x], sigma_x); // likelihood 1
  // model for y
  alpha ~ normal(0, 1);                                        // prior
  beta ~ normal(0, 1);                                         // prior
  sigma ~ exponential(1);                                      // prior
  for (i in 1:N) {
    y[i] ~ normal(alpha + beta * x_bar[individual[i]], sigma); // likelihood 2

Yes, and indeed this is the ideal approach.

Can you describe your data and the process by which it was collected in a bit more detail?

Well, the code I show above was an attempt for a toy example/theoretical question. But reading that you asked for details on the data and collection process, I now see the kind of rabbit hole I might be about to jump in!

For example, as I work with ecological/environmental data, a typical dataset I could be working on would involve modeling something measured once at multiple locations along a climatic gradient, and building a model to predict this thing with climate variables (for example mean annual temperature, MAT). As I expect the quantity of the measured thing to be the result of the climate of the past 30 years but I did not measure the thing every year for 30 years, I could take the mean of the 30 MAT from the years preceding sampling and use that as a proxy for the average temperature at the location… However, I know climate varies from year to year and I would like to account for this variation as an uncertainty around the mean of MAT. The model I attempted above (intercept only for x) would be the simplest way to do that I could think about. However, it ignores that MAT is actually a time series in this example, so things could quickly become a bit more complex!

This generalization might help: if you can make observations of x multiple times under conditions where the true value for x shouldn’t be different, but the measurements of x are in fact different due to unmodeled phenomena, then including a measurement error model of x and using the parameter reflecting the true x as the regressor for y will be optimal.

If your x measurements are such that each is expected to vary in the true value for x, then you’re out of luck and have to stick with measured x as your regressor for y.

I kind of see what you mean, but the climate variable example bugs me. I expect each yearly MAT to be different, but the x I want to use is the mean of 30 MAT. Since it’s a mean, it has associated uncertainty which may be of interest in the model y=f(x)… I think about locations with wide ranges of MAT vs others with narrower ranges.

I should correct my generalization: whenever a collection of x measurements can be modelled with a glm that includes an intercept, you can then use that intercept to predict any y that is associated with that collection. This includes:

x_obs[,collection_i] ~ normal( intercept , noise );

x_obs[,collection_i] ~ normal( intercept + beta*x_preds[,collection_i], noise );

and many other configurations of distribution (so, not just normal) and x-predictive model (many betas, or Gaussian process, etc).

1 Like

Thank you for your insights @mike-lawrence. Do you, or anyone else, have any clue about what may be wrong in my Stan code in the initial post?

I figured out the issue in my Stan code. I was trying to vectorize the submodel for x (“likelihood 1”) but it doesn’t work as x has a different length as y…

This line:

x ~ normal(x_bar[individual_x], sigma_x); // likelihood 1

should be:

for (j in 1:N_x) {
  x[j] ~ normal(x_bar[individual_x[j]], sigma_x); // likelihood 1