Linear regression with data-dependent error variances

Given some measurements with iid errors and some covariates, a linear regression model can be written as

y_n = \alpha + \beta x_n + \epsilon_n \ \ \ \mbox{where} \ \ \ \epsilon_n \sim \mathsf{normal}(0,\sigma)

which can be fitted in stan as indicated in Stan User’s Guide.

My question is how you would formulate a model where

\epsilon_n \sim \mathsf{normal}(0,\sigma_n)

because of the y_n are measured each with their own error? I think this is a common problem, but I am not really sure how to approach it… Grateful for any insight!


A separate sigma term for each observation will cause the model to become very unstable. In the standard model with a single sigma, the model still captures the fact that sometimes you’ll observe small residuals and sometimes you’ll observe larger residuals. Adding a unique sigma for each observation will render all of the sigmas under-informed, which in turn will cause the parameters associated with the mean to become under-informed.

There are cases where it might make sense to model different magnitudes of measurement noise for different observations, but these cases usually involve either very strong prior information or other variables that inform the model.


Known variances σn create no problems at all. Just include them as data, for example

data {
  int<lower=1> N;
  vector[N] x;
  vector[N] y;
  vector<lower=0.>[N] sigma;

Then in the parameter and model blocks you’d have something like

parameters {
  real a;
  real b;
model {
  target += normal_lpdf(y | a + b*x, sigma);

with, of course, appropriate priors on a and b.

This is a pretty common situation with astronomical data, which usually traces back to counts of photon arrivals. Of course if you don’t know the true uncertainties or at least have a model for how they’re generated this doesn’t work.


I see, thanks a lot!

Ah, good catch. Yes, if there is external information on distinct measurement noise magnitude expected for each observation, then the model will be most accurate if they are used. Passing them as data implies complete certainty on their value, which might not be entirely warranted; an alternative is to have them as parameters and use priors that are centered on their expected values and tapering off relatively quickly to either side.

1 Like

It depends what you mean by “error”, and what kind of information you’re trying to model. Here’s something else you may want to consider.

If you think that there is some quantity y (say a number of photons) that you’ve observed, but you’re aware that your observations z are imperfect, then you’d want to specify a probability distribution for that unknown quantity conditional on your observations and on the information you have about measurement quality. Or, to put it slightly differently, you need to specify a probability distribution for the errors of observation (\delta_i = z_i - y_i).

I think there is confusion with the use of the term “error” for different concepts; “error distribution” has its origins in measurement errors (errors of observation), but quickly came to refer to “errors” about the regression line as well (i.e., variation in y that is caused by factors not included in your covariates X even when all variables are precisely observed). (At least the history is something like that, and more). I think it helps to disentangle these two concepts.

Many applications require something like this, which incorporates both types of “error”:

data {
  int n;
  vector[n] z; // observations or estimates of y
  vector[n] se; // standard errors of the observations
  vector[n] x;

parameters {
  vector[n] y;
  real alpha;
  real beta;
  real<lower=0> sigma;

model {
 // data model
  z ~ normal(y, se);   // this says, my observations are imperfect
 // process model
  y ~ normal(alpha + x * beta, sigma);   // this says, something in addition to $x$ determines $y$
 // parameter models 
    // put priors on your parameters, including y

An important part is your background information about y. Just have to think about what kinds of observations would be reasonable, and then penalize values that are unreasonable. Might look like y \sim normal(\mu_y, \sigma_y), which will place low probability on values a couple standard deviations away from the inferred mean \mu_y (resulting in ‘shrinkage’ towards \mu_y), which may or may not be reasonable for your application area. Could use y \sim student_t(\nu_y, \mu_y, \sigma_y), or whatever… Observations with small standard error won’t budge much, but if you have relatively high uncertainty about the observation and its something like an outlier, then your prior will matter quite a bit.

The terminology (data model, process model, parameter model) comes from Berliner 1996 I think and is used throughout Cressie and Wikle 2011. Others have different terminology. This way you can make inferences about your process model conditional on your uncertainty of observation and prior information about y.

Berliner, L.M. Hierarchical Bayesian time-series models. In Maximum Entropy and Bayesian Methods; Hanson, K.M.; Silver, R.N., Eds.; Springer Netherlands, 1996.

Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; Wiley, 2011.

Richardson, S.; Gilks, W.R. A Bayesian approach to measurement error problems in epidemiology using conditional independent models. American Journal of Epidemiology 1993, 138, 430–442.


Thanks for the detailed answer! If I understand well, in my case it’s really about measurement errors rather than missing covariates. Thanks anyway for the references!