Why exactly is alpha + beta * x ~ normal(y, sigma) wrong? Is it even wrong?

Long time lurker, first time poster. All the normal disclaimers - I looked around to see if I could find something similar and didn’t, so apologies if I missed another post, or famous blog entry, or even something in the Stan manual.

But folks, this is vexing me.

I was working with someone new to Stan on a model and I saw them write something like this in the model block:

alpha + beta * x ~ normal(y, sigma);

And of course, that stood out to me as obviously “wrong.” But here’s the thing -
stan_mdl1.csv (3.7 KB)
I ran it both ways on some simulated data, and it … works?

Model 1: The Right Way
Here’s the model from the guide.

  data {
    int N;
    vector[N] y;
    vector[N] x;
  }
  parameters {
    real alpha;
    real beta;
    real<lower = 0> sigma;
  }
  model {
    y ~ normal(alpha + beta * x, sigma);
  }

Model 2: The Wrong Way?

  data {
  int N;
  vector[N] y;
  vector[N] x;
  }
  parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
  }
  model {
  alpha + beta * x ~ normal(y, sigma);
  }

Now Model 2 kicks the warning about non-linear transforms. But I know this isn’t (or I could provide a Jacobian if it was!) .

And the punch line is, the estimates from both models are nearly identical.

So, here are my questions:
1a.) Is there a good reason to prefer Model 1 over Model 2?

1b.) Would we say that Model 2 is ‘using data (y) in the prior (for alpha and beta)’ and that’s what’s ultimately wrong about it? The usual complaint about doing that is we are ‘using the data twice’ and I don’t think that’s happening here?

2.) Is this a trick of a simplified setting (linear, normal, ect.)?

some related reading for the interested:
Yes, you can include prior information on quantities of interest, not just on parameters in your model

I’m not 100% sure this is exactly the same situation, but I mean you could say alpha + beta * x is, in fact, a quantity of interest!

Thanks in advance for your thoughts

2 Likes

This is an accident of symmetric distributions, they wind up being the same thing.
Normal distribution model:

y_i \sim \mathcal{N}(\alpha + \beta x_i, \sigma) \\ p(y_i | x_i, \alpha, \beta, \sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp(-.5\frac{(y_i - (\alpha + \beta x_i))^2}{\sigma^2}) \\

What if we switched the roles of y and \hat \mu ?

\frac{1}{\sigma\sqrt{2\pi}} \exp(-.5\frac{((\alpha + \beta x_i)-y_i)^2}{\sigma^2})

Or to make it clearer:

\frac{1}{\sigma\sqrt{2\pi}} \exp(-.5\frac{(\delta_i)^2}{\sigma^2}) \\ \frac{1}{\sigma\sqrt{2\pi}} \exp(-.5\frac{(-\delta_i)^2}{\sigma^2}) \\ \delta_i = y_i - \hat\mu_i = y_i - (\alpha + \beta x_i)

As a result, they wind up being the same, in this particular case for this particular distribution, because swapping the order of y_i and \hat\mu_i in the normal density function doesn’t matter; it evaluates to the same value.

9 Likes

Perhaps worth noting that we use this identity in Gaussian measurement error models. Suppose our measurement yields a Gaussian posterior distribution for y; i.e. y \sim \mathcal{N}(\mu, \sigma). Note that y is unobserved. Thus, in the measurement error model we treat \mu as data and flip the Gaussian:
\mu \sim \mathcal{N}(y, \sigma)
y \sim ...),
where ... is the rest of the model.

6 Likes

Thanks to @Stephen_Martin for an excellent answer! And to @jsocolar for the interesting extension.

So, this is a ‘trick’ of symmetric distributions.

I checked it out with a Gamma GLM to confirm.

Model 1: The Right Way - works as expected

  data {
    int N;
    vector[N] y;
    vector[N] x;
  }
  parameters {
    real alpha;
    real beta;
    real<lower = 0> shape;
  }
  model {
    y ~ gamma(shape, shape ./ (alpha + beta * x));
  }

Model 2: The Wrong Way - Divergent transitions, fails to mix, incorrect estimates

  data {
    int N;
    vector[N] y;
    vector[N] x;
  }
  parameters {
    real alpha;
    real beta;
    real<lower = 0> shape;
  }
  model {
    alpha + beta * x ~ gamma(shape, shape ./ y);
  }

stan_mdl2.csv (3.6 KB)

1 Like

Also, when in doubt, recall that
a ~ dist(b, c);
is the same as
target += dist_lpdf(a | b, c);
for any continuous distribution “dist”. At least, I think I got this right! If I’m missing something, someone please correct me!

The point is that you should always be able to figure out what the code is doing by thinking of these ~ calls as augmentations to the target function.

2 Likes