Help understanding code and generating estimates

In my move toward bayesian models, I began with the fantastic brms package. I would like to better understand and control what I’m doing, so I have started trying to work in “pure” stan. To try to learn/understand, I have been generating stan code with brms and then trying to tweak things from there.

With a recent hierarchical bernoulli model, I’ve hit some roadblocks and outlined a couple of questions in this thread.

Before trying to approach what I think is combining parameters with indexes, it seems I need a better understanding of how the stan model is coded.

The model is:

data {
int<lower=1> N;  // number of observations
int Y[N];  // response variable
int<lower=1> K;  // number of population-level effects
matrix[N, K] X;  // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> N_1;  // number of grouping levels
int<lower=1> M_1;  // number of coefficients per level
int<lower=1> J_1[N];  // grouping indicator per observation
// group-level predictor values
vector[N] Z_1_1;
// data for group-level effects of ID 2
int<lower=1> N_2;  // number of grouping levels
int<lower=1> M_2;  // number of coefficients per level
int<lower=1> J_2[N];  // grouping indicator per observation
// group-level predictor values
vector[N] Z_2_1;
int prior_only;  // should the likelihood be ignored?
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  vector<lower=0>[M_1] sd_1;  // group-level standard deviations
  vector[N_1] z_1[M_1];  // standardized group-level effects
  vector<lower=0>[M_2] sd_2;  // group-level standard deviations
  vector[N_2] z_2[M_2];  // standardized group-level effects
transformed parameters {
  vector[N_1] r_1_1;  // actual group-level effects
  vector[N_2] r_2_1;  // actual group-level effects
  r_1_1 = (sd_1[1] * (z_1[1]));
  r_2_1 = (sd_2[1] * (z_2[1]));
model {
  // initialize linear predictor term
  vector[N] mu = Intercept + Xc * b;
  for (n in 1:N) {
    // add more terms to the linear predictor
    mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n];
  // priors including all constants
  target += normal_lpdf(b | 0,1);
  target += normal_lpdf(Intercept | 0,1);
  target += cauchy_lpdf(sd_1[1] | 0,1)
  - 1 * cauchy_lccdf(0 | 0,1);
  target += std_normal_lpdf(z_1[1]);
  target += cauchy_lpdf(sd_2[1] | 0,1)
  - 1 * cauchy_lccdf(0 | 0,1);
  target += std_normal_lpdf(z_2[1]);
  // likelihood including all constants
  if (!prior_only) {
    target += bernoulli_logit_lpmf(Y | mu);
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // additionally draw samples from priors
  real prior_b = normal_rng(0,1);
  real prior_Intercept = normal_rng(0,1);
  real prior_sd_1_1 = cauchy_rng(0,1);
  real prior_sd_2_1 = cauchy_rng(0,1);
  // use rejection sampling for truncated priors
  while (prior_sd_1_1 < 0) {
    prior_sd_1_1 = cauchy_rng(0,1);
  while (prior_sd_2_1 < 0) {
    prior_sd_2_1 = cauchy_rng(0,1);

I’m confused by the parameters block and I imagine it’s a basic concept I’m lacking.

sd_1, sd_2, z_1, and z_2 are defined, but I don’t see where they are calculated. Is sd_ a shortcut/shorthand for calculating standard deviation? Is z_ a shortcut for scaling and centering?

Yeah, this is a pretty common conceptual hurdle for a lot of folks, myself included when I was first learning Bayes and Stan. Any variable defined as a parameter in the parameters block is never actually “calculated” per se in a model. The sampler will guess values for each parameter and there’s a complicated process by which it keeps guessing and eventually yields a posterior distribution from the record of guesses it makes. Each time a guess is made for all the parameters, they are then used in the transformed parameters and model blocks according to whatever code you have there. So for sd_1 and sd_2, whatever guesses are made for these parameters are used to define the parameters r_1_1 and r_2_1, and sd_1 and sd_2 are also used in the model block to increment the special target variable. Mathematically, incrementing a collector variable with the log-density of a parameter is the way in which the idea of a prior is actually operationalized* (Bayes theorem is the product of the prior and the likelihood, so on the log scale it’s a sum). For a given guess set for all the parameters, Stan churns through all the computations in the transformed parameters and model blocks, yielding a final value for that target variable which it then uses to shape the kinds of guesses it makes next. (There’s tons of super cool stuff that actually governs that process, including auto-differentiation, hamiltonian dynamics, etc, that I personally only have a vague understanding of)

*-> bonus points for recognizing that with this formulation, there’s no mathematical difference between expressing a prior and a likelihood; both are simply incrementing an accumulator by a log-density. Arguably a distinction can be made in that the log-densities we typically call the “likelihood” part of the model are conditional on data while the log-densities we typically call the “prior” are not, but I’ve heard Michael Betancourt talk about things as if even this distinction can be argued as not generalizable or at least not useful.

1 Like

That helps a lot!

And reminds me of the relevant sections of the books I’ve been reading.

I think some of my confusion comes from the terminology used on the code comments, but I also realize auto-generated code and comments have to be general.

It looks like z_1 and z_2 are guesses at the group level effects, in this particular case I think z_1 is my random effect of condition which has z_2 (Group) nested within it. The term “standardized” threw me off because I’m used to “standardized” meaning something has been scaled, using “z” for these parameters had me thinking along the lines of a z transformation - but it looks more like it’s a guess of the coefficients (in this case, there’s only one (the intercept)) for each level.

You had it correct! z_1 is indeed a standardized coefficient, so it’s multiplied by sd_1 to un-standardize it and yield r_1_1, the unstandardized coefficient, which is necessary for expressing the likelihood part of the model.

Ahhh, now I see it - the priors for z_1 and z_2 have target += std_normal_lpdf

Actually, the priors for the standardized coefficients could be anything, those priors just happen to imply the priors r_1_1 ~ normal(0,sd_1), which is a decent weakly-informative default.

That leads to me next silly question- what makes the coefficients standardized?

Think about it this way. If you had the unstandardized coefficient, and wanted to get what the standardize coefficient is, you would divide by the standard deviation. So a coefficient that when multiplied by the standard deviation yields the unstandardized coefficient is a standardized coefficient.

How do they get standardized (divided by sd) in the first place? Is that just a consequence of how the sampling works?

I guess with population-level parameters there is no variance/sd, so the model estimates the group-level parameter by multiplying the population-level parameter with the sample sd.