Estimating slope and intercept (linear regression) for data x vs. y with uncertainties in both x and y

Hi,

I’m very new to Stan. I have a data set of x vs. y with estimated errors (sigma_x and sigma_y). I want to estimate a slope and intercept (with a probability distribution) that best explains my data. I wrote the following model and it generates a slope and intercept distribution that makes sense to me but I need some trained eyes to have a look if possible. Particularly, I just want to make sure there isn’t anything improper about the way I tried to incorporate sigma_x into the model by generating the parameter x_new and then sampling y based on x_new. Also, I declare alpha and beta in the model block to have normal distributions but the predictions naturally center around different values (0.48 and 0.001 in this case). Am I doing something unnecessary here?

I use the mean of y_pred to calculate a best estimate line through he data, and std of y_pred to generate confidence bands around it. I also generated y_pred2, which takes into account variance in x. I use y_pred2 to generate a cloud of all realizations behind the data, which in my mind provides a better estimate of the overall uncertainties of the linear regression.

I would like to know if there is anything non-sensical about what I’m doing.

Thank you.

‘’’‘stan
data {
int<lower=0> N; //extract parameter N from data
vector[N] x; //extract x variable from data
vector[N] sigma_x; //extract uncertainty in x from data
vector[N] y; //extract y variable from data
vector[N] sigma_y; //extract uncertainty in y from data
}
parameters {
vector[N] x_new;
real beta; //declare beta0 as continuous unbounded probability distribution
real alpha; //declare beta1 as continuous unbounded probability distribution
}
model {
beta ~ normal(0, 100); //prior on beta0
alpha ~ normal(1, 1); //prior on beta1
x_new ~ normal(x, sigma_x); //account for error in x
y ~ normal(beta + alphax_new, sigma_y); //likelihood
}
generated quantities{
vector[N] y_pred; //y_pred vector
vector[N] y_pred2; //y_pred2 vector
y_pred = beta + alpha
x; //posterior for “true y” given x
y_pred2 = beta + alpha*x_new;
}
‘’’’

Hello!

I am a beginner in errors-in-variable models. I just would point out there is at least two kind of model (they can be named, but I do not find their detominations anymore).

There is a R blogger post fitting a bayesian error model of the form :

x_{obs} \sim N(x_{true}, \sigma_x)

The x error part of the model you are fitting is formulated as follow :

x_{true} \sim N(x_{obs}, \sigma_x)

Unfortunatly, I do not know the strengths and weaknesses of these two formulations. I find however the former more instinctive : the relationship between true and observed predictors is a gaussian noise, the observed ones being a random realisation of the underlying distribution defined by a true mean and an error term.

I have to think a little bit more about the differences between y_pred and y_pred2. Given that your parameters are estimated conditionally to x_new, I am not sure of the interpretation of y_pred. I do agree that the variation around y_pred2 represents the spectrum of realization which can be observed in the world, given your measurement error.

About your priors, the scale of beta seems pretty big (sd = 100), just verify it is adapted to your data.

EDIT : typo

The first thing @ldeschamps posted is the right one I think (x_{obs} \sim N(x_{true}, \sigma_x)).

You’ll want to use:

x ~ normal(x_new, sigma_x);
y ~ normal(beta + alpha x_new, sigma_y);

In your model.

Have a look at 14.1 (Bayesian Measurement Error Model) in the 2.17.0 manual for where this comes from. Good luck!

x_new is a parameter, and is sampled by the MCMC chain. y is data, and is not sampled in this model. Might seem pedantic, but it’s easy to mess things up so best to get the words right.

Confidence intervals are different things, so they tell me. Posterior intervals is a nice way of saying it. Since I already started being pedantic, I figured I’d point this out :D. I get what you mean though.

You’ll want to take all the noises into account in your posterior predictives. So that means:

generated quantities {
  vector[N] x_pred;
  vector[N] y_pred;
  for(n in 1:N) {
    x_pred[n] = normal_rng(x_new[n], sigma_x); // Might as well look at x_predictions as well
    y_pred[n] = normal_rng(beta + alpha * x_new[n], sigma_y); // Use x_new here -- these are samples of your estimates of the "true" x
  }
}

You can, of course, look at bits and pieces of this but best to generate the full thing so you don’t underestimate your posterior intervals.

I have a similar model with just a slight twist. I assume x and y are observed with known uncertainties that are different for each observation. The latent y is linear in the latent x + an additive error with unknown standard deviation. Here’s the full Stan model

/**
 * Simple regression with measurement error in x an y
*/

data {
  int<lower=0> N;
  vector[N] x;
  vector<lower=0>[N] sd_x;
  vector[N] y;
  vector<lower=0>[N] sd_y;
} 

// standardize data
transformed data {
  real mean_x = mean(x);
  real sd_xt = sd(x);
  real mean_y = mean(y);
  real sd_yt = sd(y);
  
  vector[N] xhat = (x - mean_x)/sd_xt;
  vector<lower=0>[N] sd_xhat = sd_x/sd_xt;
  
  vector[N] yhat = (y - mean_y)/sd_yt;
  vector<lower=0>[N] sd_yhat = sd_y/sd_yt;
  
}


parameters {
  vector[N] x_lat;
  vector[N] y_lat;
  real beta0;
  real beta1;
  real<lower=0> sigma;
}
transformed parameters {
  vector[N] mu_yhat = beta0 + beta1 * x_lat;
}
model {
  beta0 ~ normal(0., 2.);
  beta1 ~ normal(0., 5.);
  sigma ~ normal(0., 2.);
  
  xhat ~ normal(x_lat, sd_xhat);
  y_lat ~ normal(mu_yhat, sigma);
  yhat ~ normal(y_lat, sd_yhat);
  
} 

generated quantities {
  vector[N] xhat_new;
  vector[N] yhat_new;
  vector[N] y_lat_new;
  vector[N] x_new;
  vector[N] y_new;
  vector[N] mu_x;
  vector[N] mu_y;
  real b0;
  real b1;
  
  b0 = mean_y + sd_yt*beta0 - beta1*sd_yt*mean_x/sd_xt;
  b1 = beta1*sd_yt/sd_xt;
  
  mu_x = x_lat*sd_xt + mean_x;
  mu_y = mu_yhat*sd_yt + mean_y;
  
  for (n in 1:N) {
    xhat_new[n] = normal_rng(x_lat[n], sd_xhat[n]);
    y_lat_new[n] = normal_rng(beta0 + beta1 * x_lat[n], sigma);
    yhat_new[n] = normal_rng(y_lat_new[n], sd_yhat[n]);
    x_new[n] = sd_xt * xhat_new[n] + mean_x;
    y_new[n] = sd_yt * yhat_new[n] + mean_y;
  }
}

Here’s some R code to create a reproducible example:

testme <- function(N=200, b0=10, b1=1, sigma=0.1, seed1=1234, seed2=23455) {
  require(rstan)
  require(ggplot2)
  set.seed(seed1)
  x_lat <- rnorm(N)
  sd_x <- runif(N, max=0.2)
  sd_y <- runif(N, max=0.2)
  x_obs <- x_lat+rnorm(N, sd=sd_x)
  y_lat <- b0 + b1*x_lat + rnorm(N, sd=sigma)
  y_obs <- y_lat + rnorm(N, sd=sd_y)
  stan_dat <- list(N=N, x=x_obs, sd_x=sd_x, y=y_obs, sd_y=sd_y)
  df1 <- data.frame(x_lat=x_lat, x=x_obs, sd_x=sd_x, y_lat=y_lat, y=y_obs, sd_y=sd_y)
  sfit <- stan(file="ls_me.stan", data=stan_dat, seed=seed2, chains=4)
  print(sfit, pars=c("b0", "b1", "sigma"), digits=3)
  g1 <- ggplot(df1)+geom_point(aes(x=x, y=y))+geom_errorbar(aes(x=x,ymin=y-sd_y,ymax=y+sd_y)) +
                      geom_errorbarh(aes(y=y, xmin=x-sd_x, xmax=x+sd_x))
  post <- extract(sfit)
  df2 <- data.frame(x_new=as.numeric(post$x_new), y_new=as.numeric(post$y_new))
  g1 <- g1 + stat_ellipse(aes(x=x_new, y=y_new), data=df2, geom="path", type="norm", linetype=2, color="blue")
  plot(g1)
  list(sfit=sfit, stan_dat=stan_dat)
}

which should produce this graph:

Hi, thanks for the response.

I think bbbales2 is correct that the first version you wrote is the right one. Using the first version means x observation is a random sampling from the true x, whereas the second version assumes observation x represents the population mean, which we do not necessarily know. This is my current understanding. I should note, I tried both and observed very small change in the posterior.

I’m still a little confused about y_pred (see my reply below to bbbales).

Thanks for the responses. I think I understand the why x_obs ~ N(x_true,sigma_x) is the correct one. I’m still a little confused about the y_pred. I see the need to incorporate the uncertainty in x to the posterior. However, the biggest difference I saw between the code you suggested vs what I was using stems from the fact that your version has normal_rng sampling incorporated in a for loop. When I write the code as y_pred = beta + alpha *x_new (element wise calculation - no for loop), the posterior is dramatically different (much narrower). I’m now confused about how the sampling works in the two different versions. Is there no sampling in the version that I write? Where do the 10000 iterations (what I’m using) come from then?

PS. I very much appreciate the naming corrections by the way.

Fair question

beta + alpha * x_new[n]

In generated quantities will give you predictions of the posterior mean of y. But in your model you say

y ~ normal(beta + alpha x_new, sigma_y);

Which means if you want to generate data that looks like your measurements y, you need to have the normal_rng there.

The posterior on the mean of y and y itself will have different variances. They’re different things. Maybe doing some code experiments with normal(0, 1) is the easiest way to see what’s happening.

I think I understand what you mean but I don’t think that is what I want in this application. I’m interested in the posterior for the linear model, or to rephrase it, a posterior on the predictive capability of the mean slope and intercept. The code you wrote, if I indeed understand it correctly, provides a posterior for individual data pairs, meaning if we were to measure another [x_obs, y_obs], they would fall in the posterior your code generates. My code uses one estimate of alpha and beta to calculate y[1:N] = alpha*x[1:N] + beta. In yours, normal_rng in for loop results in a different alpha and beta for each point, does it not?

I think you don’t want this is that as you add lots and lots of data because your posteriors will probably get very tight on these parameters, but that won’t reflect the actual uncertainty you have about your model.

The normal_rng doesn’t affect the alphas and betas. The alphas and betas are just as different between each generated quantities call in either version of the code.

This is probly a case where an experiment is in order.

Maybe do something like, generate data for a linear regression and in one case use 50 data points and in one case use 200 (use the same values for a, b, and sigma). Make sure the sigma is pretty large.

Maybe a = 1, b = 0, sigma = 0.25?

x = seq(0, 1, length = 50)
y = a * x + b + rnorm(length(x)) * sigma;

In both cases fit this in Stan and look at the posterior plots for

for(n in 1:N) {
  mu_hat[n] = a * x[n] + b;
  y_hat[n] = normal_rng(a * x[n] + b, sigma);
}

And compare between the 50 data points posterior and the 200 datapoints posterior. That should show the difference.

The thing you’ll see is your posterior intervals on mu_hat will continue to shrink as you add more data. y_hat should always kindof represent your data (to the extent the model can), whether there’s a lot or a little.

I try to understand the discussion but I still cannot make a model that works on a simple example.
This code would work if I remove the x modeling.
Here is the model I thought would work. But it does not. Can anyone explain what’s wrong?

data {
    int<lower=0> N; // number of points
    vector[N] x;      // x values
    vector[N] y;      // y values
    vector[N] sy;     // y uncertainties
    vector[N] sx;     // x uncertainties
}

parameters {
    real alpha_perp;
    real<lower=-pi()/2, upper=pi()/2> theta;
}

transformed parameters {
    real alpha;
    real beta;
    vector[N] ymodel;
    vector[N] xmodel;
    
    alpha = alpha_perp / cos(theta);
    beta = sin(theta);
    // for (j in 1:N)
    //     ymodel[j] = alpha + beta * xmodel[j];
}

model {
    x ~ normal(xmodel, sx);
    y ~ normal(beta + alpha * xmodel, sy);
   
}

In your transformed parameters block, you have

//     ymodel[j] = alpha + beta * xmodel[j];

but in your model block, the positions of beta and alpha are reversed:

y ~ normal(beta + alpha * xmodel, sy);

Also, xmodel isn’t defined with respect to other parameters, so it belongs in the parameters block, not the transformed parameters block.

1 Like

xmodel and ymodel should be in the parameters block.

Good point, though since ymodel isn’t used in the model block, one can get rid of it altogether.

Thanks (there was some residuals of attempts, hence the ymodel that was not used) I think moving the xmodel to the paramters block was the solution. The following runs but does not give a correct result.

data {
    int<lower=0> N; // number of points
    vector[N] x;      // x values
    vector[N] y;      // y values
    vector[N] sy;     // y uncertainties
    vector[N] sx;     // x uncertainties
}

parameters {
    real alpha_perp;
    real<lower=-pi()/2, upper=pi()/2> theta;
    vector[N] xmodel;
}

transformed parameters {
    real alpha;
    real beta;
    
    alpha = alpha_perp / cos(theta);
    beta = sin(theta);
}

model {
    xmodel ~ normal(x, sx);
    y ~ normal(alpha + beta * xmodel, sy);
}

The following gives an approximatively good result, but the prior is different by construction

data {
    int<lower=0> N; // number of points
    vector[N] x;      // x values
    vector[N] y;      // y values
    vector[N] sy;     // y uncertainties
    vector[N] sx;     // x uncertainties
}

parameters {
    real alpha;
    real beta;
    vector[N] xmodel;
}

transformed parameters {
}

model {
    alpha ~ normal(0, 100);
    beta ~ normal(0, 100);
    xmodel ~ normal(x, sx);
    y ~ normal(alpha + beta * xmodel, sy);
}

Thanks, I guess this last issue is on my side.

@Michael_Peck I’m building a model exactly like yours (regression with uncertainty in both x and y).

We have the posterior distribution for alpha and beta, which makes it easier to predict y given a new x_lat. However, I’m a bit stuck regarding how to predict y given a new x?

I’ve been writing an online diary (aka blog) for a while now and wrote a post about that model not long ago: link. There’s a reproducible fake data example there and a real data set as well.

The specific application I had in mind involved astronomical data where it’s pretty common to have nominally known uncertainties in all measurements.

3 Likes