New Stan data type: zero_sum_vector

functions {
   vector Q_sum_to_zero_QR(int N) {
    vector [2*N] Q_r;

    for(i in 1:N) {
      Q_r[i] = -sqrt((N-i)/(N-i+1.0));
      Q_r[i+N] = inv_sqrt((N-i) * (N-i+1));
    }
    //Q_r = Q_r * inv_sqrt(1 - inv(N));
    return Q_r;
  }

  vector sum_to_zero_QR(vector x_raw, vector Q_r) {
    int N = num_elements(x_raw) + 1;
    vector [N] x;
    real x_aux = 0;

    for(i in 1:N-1){
      x[i] = x_aux + x_raw[i] * Q_r[i];
      x_aux = x_aux + x_raw[i] * Q_r[i+N];
    }
    x[N] = x_aux;
    return x;
  }
}
data {
  int<lower=0> K;
  int<lower=0> N;
  array[K, N] int<lower=0, upper=1> y;
}
transformed data {
  vector[2 * K] Q_alpha = Q_sum_to_zero_QR(K);
}
parameters {
  real mu;
  real<lower=0> sigma;
  vector[K - 1] alpha_pre;
}
transformed parameters {
  vector[K] alpha = mu + sum_to_zero_QR(alpha_pre, Q_alpha);
}
model {
  alpha ~ normal(mu, inv(sqrt(1 - inv(K))) * sigma);
  //mu ~ normal(0, 10);
  sigma ~ lognormal(0, 1);
  for (k in 1:K)
    y[k] ~ bernoulli_logit(alpha[k]);
}
generated quantities {
  real mean_alpha = mean(alpha);
  real<lower=0> sd_alpha = sd(alpha);
}

An alternative to SVD based on QR decomposition.

1 Like

I don’t know if I agree that it’s an actual violation. It’s more of a composition of multiple rules of thumb at the same time.

As you note a little bit later it’s not that the Jacobian isn’t needed, it’s that the Jacobian is a constant and Stan programs are equivalent up to translations of the final target value. Indeed user can write a Stan program with that constant Jacobian to be very pedantic, which itself is good practice. The ubiquity of examples that implement half-normal priors without an explicit Jacobian makes the examples appear simpler but that makes it harder to recognize and appreciate these subtleties later in one’s pedagogical journey.

So the half-normal example is correct if we consider both (2) and
4. Stan programs are equivalent up to additive constants.

The term “Jacobian” is definitely prone to abuse and I strongly agree with more care to use terms like “Jacobian determinant” and “Jacobian correction” in the context of proper bijections.

I think that the subtly here is how a multivariate constrained type is interpreted. Interpreting multivariate constrained types simply as arrays of independent variables definitely leads to confusion – why are there a different number of elements in the Stan program and in the unconstrained space that the Stan algorithms actually explore?

Constrained types, however, are not just arrays. They are arrays along with the defining constraint. Because of the sum-to-zero constraint zero_sum_vector[N] y; isn’t an array defined by N independent variables but rather N dependent variables. Equivalently we can interpret the type as implicitly defining N - 1 independent variables.

Stan works with those N - 1 independent variables, which requires integrating out the redundant information. This integration might require transforming the N variable so that the redundant information is isolated into just one variable that’s trivial to marginalize out, or integrating out the redundant information to a subset of the real numbers and then transforming to the full real numbers, or both. The simplex transformation, for example, does both.

The Stan program, however, presents the N dependent variables. It’s not that there is an extra variable hanging around it’s that the variables are coupled together deterministically and the user can be guaranteed that Stan will manage that constraint properly.

I think that these dual interpretations of a larger collection of dependent variables and a smaller collection of independent variables helps clarify some of the conceptual issues. For example there isn’t always a unique map from one to the other; there are many ways to map from dependent variables to independent variables. In the zero_sum_vector case we can take any collection of N - 1 dependent variable as the independent variables and define the remaining variable as their sum.

In terms of building up intuition around Jacobians I think that the most robust interpretation that doesn’t involve all of the mathematics is to think about the constraints as tell Stan how to removing parts of the parameter space. For example in the half-normal case we want to start with a normal density defined over an unconstrained space and then remove the negative half. What remains is a half-normal density over the positive half. Similarly

parameters{
   zero_sum_vector[N] y;
}
model{
   y ~ std_normal();
}

would be interpreted as a product of N standard normal density functions before Stan excised the points where \sum_{n} y_{n} \ne 0.

Jacobian problems arise when we try to parameterize the excised spaces directly, but those direct parameterizations are warped relative to the initial parameterization. Removing the negative half of the real values that the variable x can take with a positivity constraint doesn’t warp the positive half, but modeling the positive half with y = log(x) does. In the case of a simplex[N] variable we have to excise most of the initial parameter space, and what’s left can’t be directly modeled with N - 1 independent random numbers. Instead we have to take the remaining N - 1 dimensional space and warp it into an unconstrained N - 1-dimensional real space.

I don’t think that is a particular fair assessment of the claims made in the paper:

Second, Table 3 suggests that the Gibbs Sampler can be substantially more efficient than HMC and NUTS for random effect models such as Model CkP, thanks to a lower runtime and, arguably, a more direct use of conditional independence among random effects. We note that empirical runtimes can be highly dependent on software implemen- tations and this could be unfavourable to a generic software implementation such as the STAN.

The speed improvements are relatively small and could easily be accounted for by optimizing the Stan configuration for this particular kind of model.

Here’s the same quote with my emphasis:

Table 3 suggests that the Gibbs Sampler can be substantially more efficient than HMC and NUTS for random effect models such as Model CkP,

our results suggest that Gibbs Sampling schemes built on our methodological guidance can be substantially cheaper than gradient-based ones in the context of hierarchical models, leading to improved performances

Was the perceived unfairness because I didn’t qualify with “suggesting” as they do? Otherwise, I don’t get what you think isn’t fair about my summary.

Their table 3 reports a speedup of 15 times. And they implemented their approach in R, so they’ve also got some headroom for computational speedups. I think you’d need a factor of 15 speedup based on their Table 3 (NUTS has 40 ESS/s whereas their Gibbs has 6000 ESS/s). I gave them a factor of 2 by rewriting the Stan code more efficiently. But then they’re using R! By optimizing the Stan config, did you mean setting number of warmup and sampling iterations and initial stepsize and whatnot? That’s probably still easier than writing your own conditionals like they did.

Spot quotes from the abstract and introduction will always come across as boisterous due to the limited context, but I think that their discussion in the actual experiment section is as fair as possible in particular the line that I quoted above “We note that empirical runtimes can be highly dependent on software implementations and this could be unfavourable to a generic software implementation such as the STAN.”.

Yes and also the configuration of the warmup adaptation phases to move from robustness across multiple target distribution to speed for a particular target distribution.