Bayesian multi-level (partial pooling) model -- multiple likelihood statements

From a tutorial a came across, the author has two likelihood statements in the model block; I’m curious if this is (1) incorrect, (2) works but not the cleanest/best way to specify a model or (3) the right way to do things in Stan.

The Stan code:

data {
  int n; // number of observations
  int n_county; // number of counties
  vector[n] log_radon;
  vector[n] vfloor;
  int<lower = 0, upper = n_county> county[n];  
}

parameters {
  vector[n_county] alpha; // vector of county intercepts
  real beta; // slope parameter
  real<lower = 0> sigma_a; // variance of counties
  real<lower = 0> sigma_y; // model residual variance
  real mu_a; // mean of counties
}

model {
  // conditional mean
  vector[n] mu;

  // linear combination
  mu = alpha[county] + beta * vfloor;

  // priors
  beta ~ normal(0, 1);

  // hyper-priors
  mu_a ~ normal(0, 1);
  sigma_a ~ cauchy(0, 2.5);
  sigma_y ~ cauchy(0, 2.5);

  // level-2 likelihood
  alpha ~ normal(mu_a, sigma_a);

  // level-1 likelihood
  log_radon ~ normal(mu, sigma_y);
}

generated quantities {
  vector[n] log_lik; // calculate log-likelihood
  vector[n] y_rep; // replications from posterior predictive distribution

  for (i in 1:n) {
    // generate mpg predicted value
    real log_radon_hat = alpha[county[i]] + beta * vfloor[i];

    // calculate log-likelihood
    log_lik[i] = normal_lpdf(log_radon[i] | log_radon_hat, sigma_y);
    // normal_lpdf is the log of the normal probability density function

    // generate replication values
    y_rep[i] = normal_rng(log_radon_hat, sigma_y);
    // normal_rng generates random numbers from a normal distribution
  }
}

I’m also unclear on why the author is defining log likelihood statements in the generating quantities portion. I had believed this section is primarily deterministic operations on the inferred variables and isn’t, itself, “part of the model”. Is this correct?

Hello @Jacob_Moore. I saw the tutorial you have as a basis.
I will try to explain your doubts from my perspective. ok.
If I’m not mistaken, the simplified model that is in the Stan code provided is:

y \sim N(\mu_y \; , \; \sigma_{y})
\mu_y = \alpha_i + \beta \cdot x
\alpha_i \sim N(\mu_{\alpha} \; , \; \sigma_{\alpha})

In my view, two likelihoods are not declared in the model block. Only one model is declared, which is the one shown above. Where y is the dependent variable with normal distribution, \mu is the linear model, x the predictor, \sigma_y the variance of y; \alpha_i and \beta the model parameters; i is the index for the countries in the model. A particularity is the \alpha_i parameter. It belongs to another normal probability distribution in which it has a mean \mu_{\alpha} and variance \sigma_{\alpha}. This implies that there is some between-countries variability in the individual-level intercept. This simply means that each country will provide a different intercept from a normal probability distribution.

From a tutorial a came across, the author has two likelihood statements in the model block; I’m curious if this is (1) incorrect, (2) works but not the cleanest/best way to specify a model or (3) the right way to do things in Stan.

Answering your question: (1) I believe it is not incorrect. (2) Because it is a simple model, I believe it is a reasonable way to write the stan model. (3) I believe the Stan team can answer this question if there is a better way to write this model in stan.

I’m also unclear on why the author is defining log likelihood statements in the generating quantities portion. I had believed this section is primarily deterministic operations on the inferred variables and isn’t, itself, “part of the model”. Is this correct?

Defining the log-likelihood in the generating quantities block is a common practice in the Stan language. It is very useful in many ways, mainly for model diagnostics with the loo package. You can read here that:

“The generated quantities program block is rather different than the other blocks. Nothing in the generated quantities block affects the sampled parameter values. The block is executed only after a sample has been generated.”

I really like Bruno Nicenboim, Daniel Schad, and Shravan Vasishth’s book (An Introduction to Bayesian Data Analysis for Cognitive Science). The example in chapter 5.2 will help you a lot. I highly recommend reading this chapter in your studies.

Hope this helps.

2 Likes

this is pretty standard.

the generated quantities block is being used to report the log likelihood, based on the estimated parameters for that draw.

and y_rep is giving you the posterior predictive distribution: 27.5 Posterior prediction for regressions | Stan User’s Guide

here’s another version of this tutorial - doing pretty much the same thing - https://mc-stan.org/users/documentation/case-studies/radon.html