What does the log lower bound transform (Stan manual section 10.2) mean for lower bounded parameters?

Hello,

I am trying to understand the implicit transformation that Stan does when declaring a lower bound on a parameter. It is said in section 10.2 of the reference manual (Lower bounded scalar section) that when declaring a lower bound on a variable X, it is then transformed as such: Y=log(X-a) where a is the lower bound. But what does this implicit transformation mean and how does it not create NaN errors since log(0 or less) = NaN? I’m trying to understand how I could explicitly implement this functionality in another language to better understand it.

Thanks

The transform Y = log(X - a) is only used for initialization/on data input, where there is already a check in place that X > a.

The transform used during sampling is the inverse transform X = exp(Y) + a where Y has unbounded support.

;So if I understand this right: if i define a parameter, for example real<lower=0> sigma, when it is passed to functions in my model it is actually exp(sigma) that is passed?

It depends on the context, but usually, yes .

Consider the code

parameters {
   real<lower=0> sigma
}

If you are using the initialziation features of the model, you can pass in values on the constrained scale, which are then using the log-based transform. This also includes error checking, for example if you init=inits.json

{
    "sigma": -1
}

This will fail. The message mentions lb_free, which is what we call the lower bound transform.

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: lb_free: Lower bounded variable is -1, but must be greater than or equal to 0.000000 (found before start of program)

Outside of initialization, the inverse transform is used. This is because our algorithms like HMC are defined over the unconstrained real numbers. So, the algorithm will generate a new value for the parameter on (-\inf, \inf) and then exp it.

Hmmm I think I am confused on this definition of lower bound. Maybe if I show you an example in my code it will make more sense:

// generated with brms 2.17.0
functions {
  
  /* hurdle lognormal log-PDF of a single response
   * logit parameterization of the hurdle part
   * Args:
   *   y: the response value
   *   mu: mean parameter of the lognormal distribution
   *   sigma: sd parameter of the lognormal distribution
   *   hu: linear predictor for the hurdle part
   * Returns:
   *   a scalar to be added to the log posterior
   */
  real hurdle_lognormal_logit_lpdf(real y, real mu, real sigma, real hu) {
    if (y == 0) {
      return bernoulli_logit_lpmf(1 | hu);
    } else {
      return bernoulli_logit_lpmf(0 | hu) +
             lognormal_lpdf(y | mu, sigma);
    }
  }
}
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable (MBH)
  int<lower=1> K;  // number of population-level effects
  matrix[N, K] X;  // population-level design matrix
  int<lower=1> K_hu;  // number of population-level effects
  matrix[N, K_hu] X_hu;  // population-level design matrix
  int prior_only;  // should the likelihood be ignored?
  vector[N] errors;
}
transformed data {
  int Kc = K - 1;
  matrix[N, Kc] Xc;  // centered version of X without an intercept
  vector[Kc] means_X;  // column means of X before centering
  int Kc_hu = K_hu - 1;
  matrix[N, Kc_hu] Xc_hu;  // centered version of X_hu without an intercept
  vector[Kc_hu] means_X_hu;  // column means of X_hu before centering
  for (i in 2:K) {
    means_X[i - 1] = mean(X[, i]);
    Xc[, i - 1] = X[, i] - means_X[i - 1];
  }
  for (i in 2:K_hu) {
    means_X_hu[i - 1] = mean(X_hu[, i]);
    Xc_hu[, i - 1] = X_hu[, i] - means_X_hu[i - 1];
  }
}
parameters {
  vector[Kc] b;  // population-level effects
  real Intercept;  // temporary intercept for centered predictors
  real<lower=0> sigma;  // dispersion parameter
  vector[Kc_hu] b_hu;  // population-level effects
  real Intercept_hu;  // temporary intercept for centered predictors
}
transformed parameters {
  real lprior = 0;  // prior contributions to the log posterior
  lprior += student_t_lpdf(Intercept | 3, 2, 2.5);
  lprior += student_t_lpdf(sigma | 3, 0, 2.5)
    - 1 * student_t_lccdf(0 | 3, 0, 2.5);
  lprior += logistic_lpdf(Intercept_hu | 0, 1);
}
model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] mu = Intercept + Xc * b;
    // initialize linear predictor term
    vector[N] hu = Intercept_hu + Xc_hu * b_hu;
    for (n in 1:N) {
      target += hurdle_lognormal_logit_lpdf(Y[n] | mu[n], sigma[n], hu[n]); // modify sigma for errors[n]
    }
  }
  // priors including constants
  target += lprior;
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept - dot_product(means_X, b);
  // actual population-level intercept
  real b_hu_Intercept = Intercept_hu - dot_product(means_X_hu, b_hu);
}

For instance, in this code I have a lower bound = 0 on the parameter sigma. Here are my questions specific to this example to better understand:

  1. Can it sample a negative value for sigma and then exponentiate it? Or it will only sample sigma >= 0?

  2. When passing sigma to the hurdle_lognormal_logit_lpdf function, will it pass the literal value of sigma, or rather exp(sigma)?

  3. When looking at the summary of the model after sampling, is the mean value given for sigma literally sigma or exp(sigma)/log(sigma)?

Sorry for all the questions, I’m trying to implement it in Python to better understand all the implicit details going on in Stan (I’m fairly new to Stan), but I can’t seem to obtain the same results for the fitted parameters and I think this is the issue.

The samplers will always sample over the unconstrained space \mathbb{R}^N. So, for example, the sampler is free to propose -11 for the unconstrained value of sigma.

Inside your model, the values are always on the constrained scale, so in this case sigma = exp(sigma_value_from_algorithm). Inside your functions, etc, you can assume sigma will always be positive.

Reported parameter values are on the constrained scale, e.g. they will always be positive in this case. The exception to this is HMC diagnostic output, which will be on the unconstrained scale (e.g. it will be log(sigma)), but if you aren’t running with diagnostic outputs enabled you do not need to worry about this

1 Like

This remains confusing to me! Let’s say I want to sample from a standard normal distribution constrained to be positive. According to what I read above, this should mean Y = log(X), where X is standard normal, and we want to sample from Y? Which seems to be a reversal of transformed and untransformed variables as written in the reference manual. What would this look like in a Stan program with no <lower=0> constraints?

Not quite. When it’s passed to functions in your model, you’ll just be getting sigma as defined.

The sampler will sample on the unconstrained space (i.e., it will sample \log \sigma). But it then does the constraining transform so that the Stan program can deal with \sigma as defined. Here’s a very simple example.

parameters {
  real<lower=0> y;
}
model {
  print("y = ", y, ";  log(y) = ", log(y));
  y ~ lognormal(0, 1);
}

The print() function here will print the value of y, which will always be positive, and log(y), which is unconstrained.

The sampling statement means the model will evaluate lognormal_lupdf(y | 0, 1), where lupdf is log unnormalized pdf, and y will be positive.

The sampler will sample over log(y) and do the transform with exp() back to y. It will also add the change-of-variables correction so that the sampling works properly and our constrained draws are from the lognormal.

(Sorry for the double/delete post …)Does the lower constraint on y even make a difference in this case? It seems unneccesary.

Yes, it makes a difference, and yes it’s necessary. Unfortunately, there’s no way to automate the way BUGS/JAGS does because the problem’s undecidable for Stan (we have a Turing-equivalent language, whereas BUGS/JAGS just describe finite graphical models so it’s tractable in that context).

Stan samples unconstrained parameters and then uses the declared constraints to transform to constrained. So there will be an underlying unconstrained variable for y that is then run through exp() to make it positive (with Jacobian adjustment being done implicitly).

If you don’t include the constraint, Stan devolves to a rejection sampler. That is, whenever a negative y is proposed by the sampler (for instance, by taking a gradient step), it will be rejected. This is bad news for sampling efficiency.