Jacobian determinant adjustment for Rstanarm's implementation of the induced dirichlet prior in stan_polr

In this vignette for estimating ordinal regression with rstanarm, it is explained that the stan_polr() function uses an induced dirichlet prior on the cutpoints. The same induced prior is discussed by @betanalpha in this separate vignette. In section 2.2 of his vignette, @betanalpha states that the induced dirichlet prior necessitates a jacobian determinant adjustment, which he implements like this :

functions {
  real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] sigma = inv_logit(phi - c);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    }
    
    return   dirichlet_lpdf(p | alpha)
           + log_determinant(J);
  }
}
parameters {
  ordered[K - 1] c; // (Internal) cut points
}
model {
  c ~ induced_dirichlet(rep_vector(1, K), 0);
}

Notice that the function induced_dirichlet() returns dirichlet_lpdf(p | alpha) + log_determinant(J) .

rstanarm uses a different approach, defining the probabilities pi as parameters and the cutpoints cutpoints as transformed parameters. The relevant parts of the stan code generated by the stan_polr function look like this :

parameters {
   simplex[J] pi;
}
transformed parameters {
    cutpoints = make_cutpoints(pi, Delta_y, link);
}
model {
    target += dirichlet_lpdf(pi | prior_counts);
}

This approach computes the cutpoints as transformed parameters using the vector of probabilities pi, a scaling parameter Delta_y (which is equivalent the error of the outcome variable \sigma_y) and a link function.

Are these two methods equivalent?

Good question. The second, which I use, is fast and clean, and with the right Dirichlet concentration parameter you can mimic intercept MLEs with posterior modes.

1 Like

So basically, after reading carefully the stan documentation on reparametrization and change of variables, I see where the difference lies between these two approaches. From what I understand, @betanalpha’s code goes for the change of variable approach, which requires jacobian determinant adjustment. The stan_polr code seems to use a variable transformation approach, which does not require any adjustment.

I’m still curious about the tradeoff between these two modeling techniques however. Obviously, the stan_polr code it is much faster since it doesn’t need to numerically compute the determinant of the jacobian transformation matrix. From that perspective, I don’t see any reason why you would prefer to specify a prior directly on the cutpoints as described by @betanalpha.

Is this an empirical observation, or an intuition? Both constructions are hiding layers of complexity behind the variable declarations (e.g. the ordered declaration and the simplex declaration respectively) that involve an under-the-hood parameterization in unconstrained space and a Jacobian adjustment. See Constraint Transforms

If you’re interested in digging here, you might check whether the two approaches are literally the same underlying parameterization, with one writing out in Stan code what the other hides behind the relatively more complicated simplex constraint. If they turn out not to be the same parameterization (up to a linear rescaling of parameter space), then irrespective of which one involves more computation per gradient eval, it is possible and perhaps likely that each will outperform the other for certain datasets, since the performance of HMC sampling can be sensitive to the details of the shape of the posterior.

Is this an empirical observation, or an intuition?

I’ve tried both with simulated data and the code from rstanarm is typically much faster. However, the comparison isn’t exactly fair because the stan_polr also implements the R2D2 prior on \boldsymbol{\beta} (although it does in a weird way, using a unit_vector instead of a simplex). I would like to experiment with the R2D2 prior myself but I don’t exactly understand how to implement it with ordered logistic regression. I discuss this in my last reply from this thread if you’d like to take a look.

If you’re interested in digging here, you might check whether the two approaches are literally the same underlying parameterization, with one writing out in Stan code what the other hides behind the relatively more complicated simplex constraint.

By “writing out the simplex constraint”, do you mean doing something similar to what’s described in this “change of variable” example where the lognormal distribution is coded as the logarithm of y > 0 is assigned a normal distribution? So something like this :

parameters {
  real<lower=0> y;
}
model {
  log(y) ~ normal(mu, sigma);
  target += -log(y);
}

The stan manual chapter you linked doesn’t really give code examples so I’m not sure how I would write them out.

Yes. When you declare constrained parameters in Stan, under the hood the model works with a set of unconstrained parameters which are subjected to an appropriate constraining transform, and the target density is automatically incremented by the associated Jacobian adjustment.

About numerical computation of the determinant: I think this is not needed because the form of the Jacobian matrix is simple enough to figure out the determinant using the Leibniz formula. Here is some more info:

1 Like

It’s still not clear why it’s worth the trouble. Putting priors on the induced multinomial probabilities using Dirichlet with a well-chosen concentration parameter has worked extremely well for me, and it’s fast.

It’s still not clear why it’s worth the trouble.

My question exactly. There is the comment of @jsocolar about the hidden assumptions of stan around ordered vs simplex vectors but I’m not sure how I would test those assumptions. Working through this problem, I think I’ve managed to write both transformation in R code :

# Ordered transform
K <- 5
y <- rnorm(K)
x <- rep(0, K)
x[1] <- y[1]
for (k in 2:K) {
    x[k] <- y[1] + sum(exp(y[2:k]))
}

# Simplex transform
inv_logit <- function(x) {
    1/(1 + exp(-x))
}
K <- 5
y <- rnorm(K - 1)
z <- rep(0, K - 1)
for (k in 1:(K - 1)) {
    z[k] = inv_logit(y[k] + log(1/(K - k)))
}
x <- rep(0, K)
for (k in 1:(K - 1)) {
    x[k] = (1 - sum(x[0:(k - 1)])) * z[k]
}
x[K] = 1 - sum(x[1:(K - 1)])

The code works but I’m still trying to figure out what the next step is. @jsocolar Could you help me out from here? I’ll keep working on it.

If one parameterization is working well for you, then just use it. If it’s of academic interest, you might check whether the two ways of expressing the model are literally the same parameterization or not. If they are literally the same parameterization, then whichever one yields a faster computation time per gradient eval should always be more efficient. If they turn out not to be the same parameterization, then it possible (and perhaps likely in practice) that there could be some datasets for which either outperforms the other computationally.

There are no hidden assumptions that need testing here. The statistical assumptions/models expressed in these approaches are the same. The only question is whether and when one or the other is computationally superior.

1 Like

These are all good comments, and computational efficient does need to play an important role. We also need to have external benchmarks. For me it’s recover y of the MLE of intercepts by the posterior mode when all non-intercept priors are flat. Dirichlet with a specific k-dependent concentration parameter (k = # distinct levels of Y) does that to within a small tolerance.

1 Like