Overcoming the Simpson's paradox in a multivariate linear model with within and between site gradients and repeated measurements

Hi @jsocolar, I am finally fully testing this relatively simple version as a step to build an even more complex model, and I was wondering if my calculation of y_rep and log_lik were correct.

data {
  int<lower=0> N;
  int<lower=0> N_site;
  array[N] int<lower=1> site;
  vector[N] y;
  vector[N] x1;
  vector[N] x2;
}

parameters {
  real alpha1;
  real beta1;
  real<lower=0> sigma_alphaS;
  real beta2;
  vector[N_site] alphaS;
  real<lower=0> sigma;
}

model {
  alpha1 ~ std_normal();
  beta1 ~ std_normal();
  sigma_alphaS ~ exponential(1);
  beta2 ~ std_normal();
  alphaS[site] ~ normal(alpha1 + beta1 * x1, sigma_alphaS);
  sigma ~ exponential(1);
  y ~ normal(alphaS[site] + beta2 * x2, sigma);
}

generated quantities {
  
  // posterior predictive distribution for replications y_rep of the original data set y given model parameters
  array[N] real y_rep = normal_rng(alphaS[site] + beta2 * x2, sigma);
  
  // pointwise log-likelihood
  vector[N] log_lik;
  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alphaS[site[i]] + beta2 * x2[i], sigma);
  }
  
}

Shall the log_lik be something like the following instead? I got this idea reading this post.

  for (i in 1:N) {
    log_lik[i] = normal_lpdf(y[i] | alpha1 + beta1 * x1[i] + beta2 * x2[i], sqrt(sigma^2 + sigma_alphaS^2));
  }