Three different ways of trying to fix a parameter value in Stan

Suppose I had a linear model: y_i \sim N(\alpha + \beta x_i, \sigma_y). I want to fix the intercept to 5, implying the posterior E(y|x=0) = 5 with unknown variance. Below I show 3 different ways one might think about doing this.

  • Approach 1 enters 5 as a value directly in the likelihood (model block).
  • Approach 2 declares and assigns a parameter the value 5 in transformed parameters then includes that parameter in the likelihood.
  • Approach 3 treats the value as transformed data and adds it to the likelihood.

Are the 3 approaches below equivalent (and correct in not requiring Jacobians)? Is one of them preferred to the other computationally?

Approach 1: Included directly in the likelihood (model block)

data{
vector x;
vector y;
}

parameters{
real beta;
real sigma;
}

model{
y ~ normal(5 + beta*x, sigma); // hard-coded, where alpha = 5
sigma ~ cauchy(0,1); 
beta ~ normal(0,1); 
}

Approach 2: Using transformed parameters block

data{
vector x;
vector y;
}

parameters{
real beta;
real sigma;
}

transformed parameters{
real alpha = 5; // declare and assign fixed value 

model{
y ~ normal(alpha + beta*x, sigma); // includes alpha
sigma ~ cauchy(0,1); 
beta ~ normal(0,1); 
}

Approach 3: Using transformed data block

data{
vector x;
vector y;
}

transformed data{
real alpha = 5; // declare and assign fixed value
}

parameters{
real beta;
real sigma;
}

model{
y ~ normal(alpha + beta*x, sigma);  // includes alpha
sigma ~ cauchy(0,1);
beta ~ normal(0,1); 
}

How brms does it:

Based on a test of how brms implemented its “constant” prior to fix coefficients, it looks like it uses approach 2. For example, the below shows the underlying stan code for a brms model y \sim N(\alpha + \beta_1 x + \beta_2 z + \beta_3 w, \sigma) with the “constant” priors:

functions {
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
}
parameters {
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
}
transformed parameters {
  vector[Kc] b;  // population-level effects
  real lprior = 0;  // prior contributions to the log posterior
  b[1] = 0;
  b[2] = 1;
  b[3] = 2;
  lprior += student_t_lpdf(Intercept | 3, 0, 2.5);
  lprior += student_t_lpdf(sigma | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
}
model {
  // likelihood including constants
  if (!prior_only) {
    target += normal_id_glm_lpdf(Y | Xc, Intercept, b, sigma);
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
}

Here’s the R code to reproduce this:

library(brms)

# Example data
N <- 1000
df <- data.frame(
  y = rnorm(N),
  x = sample(1:10, size = N, replace = T),
  z = sample(1:2, size = N, replace = T),
  w = runif(N)
)

# User-specified constants as priors
ex_priors <- c( prior(constant(0), class = "b", coef = "x"),
  prior(constant(1), class = "b", coef = "z"),
  prior(constant(2), class = "b", coef = "w") )

# Fit model (ignore warnings, just an example)
mod <- brms::brm(
  formula = y ~ x + z + w,
  data = df, iter = 10, chains = 1,
  prior = ex_priors)

# Check priors
prior_summary(mod)

# Inspect Stan code
stancode(mod)
1 Like

While all three approaches imply an identical likelihood, Approaches 1 and 3 are the most computationally-efficient. When constructing something in the transformed parameters block, that quantity is recomputed/reconstructed at every iteration.

Note that brms is coded to allow for maximum flexibility in constructing models, rather than maximum performance, so it’s not always the best guide for maximally-optimised model construction

5 Likes