Bayesian multi-level (partial pooling) model -- multiple likelihood statements

From a tutorial a came across, the author has two likelihood statements in the model block; I’m curious if this is (1) incorrect, (2) works but not the cleanest/best way to specify a model or (3) the right way to do things in Stan.

The Stan code:

data {
  int n; // number of observations
  int n_county; // number of counties
  vector[n] log_radon;
  vector[n] vfloor;
  int<lower = 0, upper = n_county> county[n];  
}

parameters {
  vector[n_county] alpha; // vector of county intercepts
  real beta; // slope parameter
  real<lower = 0> sigma_a; // variance of counties
  real<lower = 0> sigma_y; // model residual variance
  real mu_a; // mean of counties
}

model {
  // conditional mean
  vector[n] mu;

  // linear combination
  mu = alpha[county] + beta * vfloor;

  // priors
  beta ~ normal(0, 1);

  // hyper-priors
  mu_a ~ normal(0, 1);
  sigma_a ~ cauchy(0, 2.5);
  sigma_y ~ cauchy(0, 2.5);

  // level-2 likelihood
  alpha ~ normal(mu_a, sigma_a);

  // level-1 likelihood
  log_radon ~ normal(mu, sigma_y);
}

generated quantities {
  vector[n] log_lik; // calculate log-likelihood
  vector[n] y_rep; // replications from posterior predictive distribution

  for (i in 1:n) {
    // generate mpg predicted value
    real log_radon_hat = alpha[county[i]] + beta * vfloor[i];

    // calculate log-likelihood
    log_lik[i] = normal_lpdf(log_radon[i] | log_radon_hat, sigma_y);
    // normal_lpdf is the log of the normal probability density function

    // generate replication values
    y_rep[i] = normal_rng(log_radon_hat, sigma_y);
    // normal_rng generates random numbers from a normal distribution
  }
}

I’m also unclear on why the author is defining log likelihood statements in the generating quantities portion. I had believed this section is primarily deterministic operations on the inferred variables and isn’t, itself, “part of the model”. Is this correct?

Hello @Jacob_Moore. I saw the tutorial you have as a basis.
I will try to explain your doubts from my perspective. ok.
If I’m not mistaken, the simplified model that is in the Stan code provided is:

y \sim N(\mu_y \; , \; \sigma_{y})
\mu_y = \alpha_i + \beta \cdot x
\alpha_i \sim N(\mu_{\alpha} \; , \; \sigma_{\alpha})

In my view, two likelihoods are not declared in the model block. Only one model is declared, which is the one shown above. Where y is the dependent variable with normal distribution, \mu is the linear model, x the predictor, \sigma_y the variance of y; \alpha_i and \beta the model parameters; i is the index for the countries in the model. A particularity is the \alpha_i parameter. It belongs to another normal probability distribution in which it has a mean \mu_{\alpha} and variance \sigma_{\alpha}. This implies that there is some between-countries variability in the individual-level intercept. This simply means that each country will provide a different intercept from a normal probability distribution.

From a tutorial a came across, the author has two likelihood statements in the model block; I’m curious if this is (1) incorrect, (2) works but not the cleanest/best way to specify a model or (3) the right way to do things in Stan.

Answering your question: (1) I believe it is not incorrect. (2) Because it is a simple model, I believe it is a reasonable way to write the stan model. (3) I believe the Stan team can answer this question if there is a better way to write this model in stan.

I’m also unclear on why the author is defining log likelihood statements in the generating quantities portion. I had believed this section is primarily deterministic operations on the inferred variables and isn’t, itself, “part of the model”. Is this correct?

Defining the log-likelihood in the generating quantities block is a common practice in the Stan language. It is very useful in many ways, mainly for model diagnostics with the loo package. You can read here that:

“The generated quantities program block is rather different than the other blocks. Nothing in the generated quantities block affects the sampled parameter values. The block is executed only after a sample has been generated.”

I really like Bruno Nicenboim, Daniel Schad, and Shravan Vasishth’s book (An Introduction to Bayesian Data Analysis for Cognitive Science). The example in chapter 5.2 will help you a lot. I highly recommend reading this chapter in your studies.

Hope this helps.

this is pretty standard.

the generated quantities block is being used to report the log likelihood, based on the estimated parameters for that draw.

and y_rep is giving you the posterior predictive distribution: 27.5 Posterior prediction for regressions | Stan User’s Guide

here’s another version of this tutorial - doing pretty much the same thing - https://mc-stan.org/users/documentation/case-studies/radon.html