Linear model with generated quantities - ypred issues?

Hello all,

I am still slowly moving along learning PyStan and shifting our models over. I am trying to understand how the generated quantities work. I have a simple linear model of ocean fish catch plotted against ocean net primary production (NPP). Previous work shows that this can be modeled roughly with a simple linear model. I also want to see how well the model does. I added a generated quantities section to the model.

I log transformed the data. Which thinking about this might be my issue with the model.

When I run fit.plot([‘y_pred’]) that values look weird to me. They range 4 to 6 while the original data goes from -8 to 2.

The juypter notebook and raw data are here

The model I used was this:

regress_code = “”"

data {
int<lower=0> N;
int N2;
vector[N] y;
vector[N] x;
}

parameters {
real alpha;
real beta;
real<lower=0> sigma;
}

model {
alpha ~ normal(0,10);
beta ~ cauchy(0,5);
sigma ~ cauchy(0,5);
for (n in 1:N)
y[n] ~ normal(alpha + beta * x[n], sigma);
}

generated quantities {
real y_pred;

y_pred = normal_rng(alpha + beta * x[N2], sigma);

}

“”"

I picked cauchy for beta since in the worse case the slope could be zero but likely it’s some positive number.

Did I set my priors up correctly and ypred? Or should y_pred not be normal_rng? Should I log transform before the model or specify that in the model?

Thank you!
ara

1 Like

I may have made some headway after diving back into the Stan manual and some old Stan forum posts.

looks like you’re using only one value of the covariate x by setting x[N2] outside a loop.

i’d do something like this:

1 Like

Ok, let me try this and see if I can figure out how all this is working.

Thanks for taking the time!

Ara

@HeltonMaciel Thanks! I think I have it working. If I understand correctly I was only generating one y or a given x. With the for loop I get all the y’s for a given x.

I also went back to just a simple linear model and then built up the model to a generated quantity model.

Ara

Do the y_pred not look like the original y even if you simulate the original y according to the model?

In the raw data y goes from 0.000305 to 9.3 (mean ten year fish catch, tons per km2).

y_pred (mean) only covers part of this range 0.39 to 2.63. The max y_pred covers the upper range but the min y_pred goes negative. I found that weird since you can’t have a negative tonnage.

This leaves me wondering if I didn’t set up the model or priors correctly. Or the original data is just not that great.

My notebook for this is on github here.

Raw data is here.

That’s why I was suggesting simulating data from the model, then fitting it. Then the y_pred should look right. It’s the only way to separate model misspecification from implementation issues.

1 Like

Oh, sorry. I misunderstood. Let me try that.

Thanks!
ara

Just to make sure I am doing this right. I hit up the Stan manual 2.16 and on page 160 under Posterior Predictive Checks.

I set up in my model:

generated quantities {
vector[N] y_tilde;
for (n in 1:N)
y_tilde[n] = normal_rng(alpha + beta * x[n], sigma);
}

EDIT: Answered my own question about the tilde.

So I ended up with:

Random data

x_tilde = random.sample(range(50, 350), 100)
N_tilde = len(x_tilde)

np.mean(x_tilde, axis=0)
192.27000000000001

Dictionary for Stan

Make a dictionary containing all data to be passed to STAN
regress_sim_dat = {'x': x,
 'y': y,
 'N': N,
 'x_tilde':x_tilde,
 'N_tilde':N_tilde}

The model, priors, and generated data.

 regress_sim = """
 data {
   int<lower=0> N;
   vector[N] y;
   vector[N] x;
   int<lower=0> N_tilde;
   vector[N_tilde] x_tilde;
 }
 parameters {
   real alpha;
   real beta;
   real<lower=0> sigma;
 }
 model {
   alpha ~ normal(0,10);    
   beta ~ cauchy(0,5);
   sigma ~ cauchy(0,5);
 
   for (n in 1:N)
     y[n] ~ normal(alpha + beta * x[n], sigma);
 }
 generated quantities {
    vector[N_tilde] y_tilde;
       
    for (n in 1:N_tilde)
      y_tilde[n] = normal_rng(alpha + beta * x_tilde[n], sigma);
 }
 """

EDIT: Cleaned up the code blocks with ```

Yes, that looks right. To set off code, you want triple back ticks on their own line before and after:

```
your code
```

Ok, cool. I think I understand all this. The y_tilde range with the model above gives a y_tilde mean between ~0.4 to ~2.8 . There are still negative numbers in the y_tilde traceplot.