# Help understanding code and generating estimates

In my move toward bayesian models, I began with the fantastic `brms` package. I would like to better understand and control what I’m doing, so I have started trying to work in “pure” stan. To try to learn/understand, I have been generating stan code with brms and then trying to tweak things from there.

With a recent hierarchical bernoulli model, I’ve hit some roadblocks and outlined a couple of questions in this thread.

Before trying to approach what I think is combining parameters with indexes, it seems I need a better understanding of how the stan model is coded.

The model is:

``````data {
int<lower=1> N;  // number of observations
int Y[N];  // response variable
int<lower=1> K;  // number of population-level effects
matrix[N, K] X;  // population-level design matrix
// data for group-level effects of ID 1
int<lower=1> N_1;  // number of grouping levels
int<lower=1> M_1;  // number of coefficients per level
int<lower=1> J_1[N];  // grouping indicator per observation
// group-level predictor values
vector[N] Z_1_1;
// data for group-level effects of ID 2
int<lower=1> N_2;  // number of grouping levels
int<lower=1> M_2;  // number of coefficients per level
int<lower=1> J_2[N];  // grouping indicator per observation
// group-level predictor values
vector[N] Z_2_1;
int prior_only;  // should the likelihood be ignored?
}
transformed data {
int Kc = K - 1;
matrix[N, Kc] Xc;  // centered version of X without an intercept
vector[Kc] means_X;  // column means of X before centering
for (i in 2:K) {
means_X[i - 1] = mean(X[, i]);
Xc[, i - 1] = X[, i] - means_X[i - 1];
}
}
parameters {
vector[Kc] b;  // population-level effects
real Intercept;  // temporary intercept for centered predictors
vector<lower=0>[M_1] sd_1;  // group-level standard deviations
vector[N_1] z_1[M_1];  // standardized group-level effects
vector<lower=0>[M_2] sd_2;  // group-level standard deviations
vector[N_2] z_2[M_2];  // standardized group-level effects
}
transformed parameters {
vector[N_1] r_1_1;  // actual group-level effects
vector[N_2] r_2_1;  // actual group-level effects
r_1_1 = (sd_1[1] * (z_1[1]));
r_2_1 = (sd_2[1] * (z_2[1]));
}
model {
// initialize linear predictor term
vector[N] mu = Intercept + Xc * b;
for (n in 1:N) {
// add more terms to the linear predictor
mu[n] += r_1_1[J_1[n]] * Z_1_1[n] + r_2_1[J_2[n]] * Z_2_1[n];
}
// priors including all constants
target += normal_lpdf(b | 0,1);
target += normal_lpdf(Intercept | 0,1);
target += cauchy_lpdf(sd_1[1] | 0,1)
- 1 * cauchy_lccdf(0 | 0,1);
target += std_normal_lpdf(z_1[1]);
target += cauchy_lpdf(sd_2[1] | 0,1)
- 1 * cauchy_lccdf(0 | 0,1);
target += std_normal_lpdf(z_2[1]);
// likelihood including all constants
if (!prior_only) {
target += bernoulli_logit_lpmf(Y | mu);
}
}
generated quantities {
// actual population-level intercept
real b_Intercept = Intercept - dot_product(means_X, b);
// additionally draw samples from priors
real prior_b = normal_rng(0,1);
real prior_Intercept = normal_rng(0,1);
real prior_sd_1_1 = cauchy_rng(0,1);
real prior_sd_2_1 = cauchy_rng(0,1);
// use rejection sampling for truncated priors
while (prior_sd_1_1 < 0) {
prior_sd_1_1 = cauchy_rng(0,1);
}
while (prior_sd_2_1 < 0) {
prior_sd_2_1 = cauchy_rng(0,1);
}
}
``````

I’m confused by the parameters block and I imagine it’s a basic concept I’m lacking.

sd_1, sd_2, z_1, and z_2 are defined, but I don’t see where they are calculated. Is `sd_` a shortcut/shorthand for calculating standard deviation? Is `z_` a shortcut for scaling and centering?

Yeah, this is a pretty common conceptual hurdle for a lot of folks, myself included when I was first learning Bayes and Stan. Any variable defined as a parameter in the parameters block is never actually “calculated” per se in a model. The sampler will guess values for each parameter and there’s a complicated process by which it keeps guessing and eventually yields a posterior distribution from the record of guesses it makes. Each time a guess is made for all the parameters, they are then used in the `transformed parameters` and `model` blocks according to whatever code you have there. So for `sd_1` and `sd_2`, whatever guesses are made for these parameters are used to define the parameters `r_1_1` and `r_2_1`, and `sd_1` and `sd_2` are also used in the model block to increment the special `target` variable. Mathematically, incrementing a collector variable with the log-density of a parameter is the way in which the idea of a prior is actually operationalized* (Bayes theorem is the product of the prior and the likelihood, so on the log scale it’s a sum). For a given guess set for all the parameters, Stan churns through all the computations in the transformed parameters and model blocks, yielding a final value for that `target` variable which it then uses to shape the kinds of guesses it makes next. (There’s tons of super cool stuff that actually governs that process, including auto-differentiation, hamiltonian dynamics, etc, that I personally only have a vague understanding of)

*-> bonus points for recognizing that with this formulation, there’s no mathematical difference between expressing a prior and a likelihood; both are simply incrementing an accumulator by a log-density. Arguably a distinction can be made in that the log-densities we typically call the “likelihood” part of the model are conditional on data while the log-densities we typically call the “prior” are not, but I’ve heard Michael Betancourt talk about things as if even this distinction can be argued as not generalizable or at least not useful.

1 Like

That helps a lot!

And reminds me of the relevant sections of the books I’ve been reading.

I think some of my confusion comes from the terminology used on the code comments, but I also realize auto-generated code and comments have to be general.

It looks like z_1 and z_2 are guesses at the group level effects, in this particular case I think z_1 is my random effect of condition which has z_2 (Group) nested within it. The term “standardized” threw me off because I’m used to “standardized” meaning something has been scaled, using “z” for these parameters had me thinking along the lines of a z transformation - but it looks more like it’s a guess of the coefficients (in this case, there’s only one (the intercept)) for each level.

You had it correct! `z_1` is indeed a standardized coefficient, so it’s multiplied by `sd_1` to un-standardize it and yield `r_1_1`, the unstandardized coefficient, which is necessary for expressing the likelihood part of the model.

Ahhh, now I see it - the priors for z_1 and z_2 have target += std_normal_lpdf

Actually, the priors for the standardized coefficients could be anything, those priors just happen to imply the priors `r_1_1 ~ normal(0,sd_1)`, which is a decent weakly-informative default.

That leads to me next silly question- what makes the coefficients standardized?

Think about it this way. If you had the unstandardized coefficient, and wanted to get what the standardize coefficient is, you would divide by the standard deviation. So a coefficient that when multiplied by the standard deviation yields the unstandardized coefficient is a standardized coefficient.

How do they get standardized (divided by sd) in the first place? Is that just a consequence of how the sampling works?

I guess with population-level parameters there is no variance/sd, so the model estimates the group-level parameter by multiplying the population-level parameter with the sample sd.