Linear regression with data-dependent error variances

Given some measurements with iid errors and some covariates, a linear regression model can be written as

y_n = \alpha + \beta x_n + \epsilon_n \ \ \ \mbox{where} \ \ \ \epsilon_n \sim \mathsf{normal}(0,\sigma)

which can be fitted in stan as indicated in Stan User’s Guide.

My question is how you would formulate a model where

\epsilon_n \sim \mathsf{normal}(0,\sigma_n)

because of the y_n are measured each with their own error? I think this is a common problem, but I am not really sure how to approach it… Grateful for any insight!

V

A separate sigma term for each observation will cause the model to become very unstable. In the standard model with a single sigma, the model still captures the fact that sometimes you’ll observe small residuals and sometimes you’ll observe larger residuals. Adding a unique sigma for each observation will render all of the sigmas under-informed, which in turn will cause the parameters associated with the mean to become under-informed.

There are cases where it might make sense to model different magnitudes of measurement noise for different observations, but these cases usually involve either very strong prior information or other variables that inform the model.

2 Likes

Known variances σn create no problems at all. Just include them as data, for example

data {
  int<lower=1> N;
  vector[N] x;
  vector[N] y;
  vector<lower=0.>[N] sigma;
}

Then in the parameter and model blocks you’d have something like

parameters {
  real a;
  real b;
}
model {
  target += normal_lpdf(y | a + b*x, sigma);
}

with, of course, appropriate priors on a and b.

This is a pretty common situation with astronomical data, which usually traces back to counts of photon arrivals. Of course if you don’t know the true uncertainties or at least have a model for how they’re generated this doesn’t work.

6 Likes

I see, thanks a lot!

Ah, good catch. Yes, if there is external information on distinct measurement noise magnitude expected for each observation, then the model will be most accurate if they are used. Passing them as data implies complete certainty on their value, which might not be entirely warranted; an alternative is to have them as parameters and use priors that are centered on their expected values and tapering off relatively quickly to either side.

1 Like

It depends what you mean by “error”, and what kind of information you’re trying to model. Here’s something else you may want to consider.

If you think that there is some quantity y (say a number of photons) that you’ve observed, but you’re aware that your observations z are imperfect, then you’d want to specify a probability distribution for that unknown quantity conditional on your observations and on the information you have about measurement quality. Or, to put it slightly differently, you need to specify a probability distribution for the errors of observation (\delta_i = z_i - y_i).

I think there is confusion with the use of the term “error” for different concepts; “error distribution” has its origins in measurement errors (errors of observation), but quickly came to refer to “errors” about the regression line as well (i.e., variation in y that is caused by factors not included in your covariates X even when all variables are precisely observed). (At least the history is something like that, and more). I think it helps to disentangle these two concepts.

Many applications require something like this, which incorporates both types of “error”:

data {
  int n;
  vector[n] z; // observations or estimates of y
  vector[n] se; // standard errors of the observations
  vector[n] x;
}

parameters {
  vector[n] y;
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model {
 // data model
  z ~ normal(y, se);   // this says, my observations are imperfect
 // process model
  y ~ normal(alpha + x * beta, sigma);   // this says, something in addition to $x$ determines $y$
 // parameter models 
    // put priors on your parameters, including y
}

An important part is your background information about y. Just have to think about what kinds of observations would be reasonable, and then penalize values that are unreasonable. Might look like y \sim normal(\mu_y, \sigma_y), which will place low probability on values a couple standard deviations away from the inferred mean \mu_y (resulting in ‘shrinkage’ towards \mu_y), which may or may not be reasonable for your application area. Could use y \sim student_t(\nu_y, \mu_y, \sigma_y), or whatever… Observations with small standard error won’t budge much, but if you have relatively high uncertainty about the observation and its something like an outlier, then your prior will matter quite a bit.

The terminology (data model, process model, parameter model) comes from Berliner 1996 I think and is used throughout Cressie and Wikle 2011. Others have different terminology. This way you can make inferences about your process model conditional on your uncertainty of observation and prior information about y.

Berliner, L.M. Hierarchical Bayesian time-series models. In Maximum Entropy and Bayesian Methods; Hanson, K.M.; Silver, R.N., Eds.; Springer Netherlands, 1996.

Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; Wiley, 2011.

Richardson, S.; Gilks, W.R. A Bayesian approach to measurement error problems in epidemiology using conditional independent models. American Journal of Epidemiology 1993, 138, 430–442.

6 Likes

Thanks for the detailed answer! If I understand well, in my case it’s really about measurement errors rather than missing covariates. Thanks anyway for the references!

I have a similar problem, and let me give an example.

I have incomes for individuals, and it’s expected that the variation in incomes (from the mean) for more aged individuals is higher compares to younger individuals. Now, of course, it would be that there is a covariate, and if I knew about it, this difference in variation might not have existed, but at this point I don’t have information of such a covariate.

In this setting, how do we deal with in a bayesian model under two situations:

  1. I know a few covariates on which variation depends – but I don’t know how.
  2. I don’t know even the features on which variation depends.

If this would go the semi-parametric path, how would that be formulated.

Looking forward to some ideas here.

Then you can set up a regression for your scales.

Then you can set up a mixture prior, but you’ll at least need to guess at the number of components. Another alternative would be to use a very highly dispersed prior, like a Cauchy.

As @mike-lawrence pointed out, this can get hard to fit, even if you have some predictors.

P.S. From earlier in the thread.

You don’t need the period here. The only place Stan cares about whether you have a period is if you have integer division—1 / 3 is different than 1.0 / 3 in that the former evaluates to the integer and the latter to the double-precision representation of one third.