Jacobian determinant adjustment for Rstanarm's implementation of the induced dirichlet prior in stan_polr

Xavier_La_Rochelle · June 13, 2024, 6:52pm

In this vignette for estimating ordinal regression with rstanarm, it is explained that the stan_polr() function uses an induced dirichlet prior on the cutpoints. The same induced prior is discussed by @betanalpha in this separate vignette. In section 2.2 of his vignette, @betanalpha states that the induced dirichlet prior necessitates a jacobian determinant adjustment, which he implements like this :

functions {
  real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] sigma = inv_logit(phi - c);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    }
    
    return   dirichlet_lpdf(p | alpha)
           + log_determinant(J);
  }
}
parameters {
  ordered[K - 1] c; // (Internal) cut points
}
model {
  c ~ induced_dirichlet(rep_vector(1, K), 0);
}

Notice that the function induced_dirichlet() returns dirichlet_lpdf(p | alpha) + log_determinant(J) .

rstanarm uses a different approach, defining the probabilities pi as parameters and the cutpoints cutpoints as transformed parameters. The relevant parts of the stan code generated by the stan_polr function look like this :

parameters {
   simplex[J] pi;
}
transformed parameters {
    cutpoints = make_cutpoints(pi, Delta_y, link);
}
model {
    target += dirichlet_lpdf(pi | prior_counts);
}

This approach computes the cutpoints as transformed parameters using the vector of probabilities pi, a scaling parameter Delta_y (which is equivalent the error of the outcome variable \sigma_y) and a link function.

Are these two methods equivalent?

harrelfe · June 14, 2024, 12:45pm

Good question. The second, which I use, is fast and clean, and with the right Dirichlet concentration parameter you can mimic intercept MLEs with posterior modes.

Xavier_La_Rochelle · June 30, 2024, 6:42pm

So basically, after reading carefully the stan documentation on reparametrization and change of variables, I see where the difference lies between these two approaches. From what I understand, @betanalpha’s code goes for the change of variable approach, which requires jacobian determinant adjustment. The stan_polr code seems to use a variable transformation approach, which does not require any adjustment.

I’m still curious about the tradeoff between these two modeling techniques however. Obviously, the stan_polr code it is much faster since it doesn’t need to numerically compute the determinant of the jacobian transformation matrix. From that perspective, I don’t see any reason why you would prefer to specify a prior directly on the cutpoints as described by @betanalpha.

jsocolar · July 1, 2024, 1:24am

Is this an empirical observation, or an intuition? Both constructions are hiding layers of complexity behind the variable declarations (e.g. the ordered declaration and the simplex declaration respectively) that involve an under-the-hood parameterization in unconstrained space and a Jacobian adjustment. See Constraint Transforms

If you’re interested in digging here, you might check whether the two approaches are literally the same underlying parameterization, with one writing out in Stan code what the other hides behind the relatively more complicated simplex constraint. If they turn out not to be the same parameterization (up to a linear rescaling of parameter space), then irrespective of which one involves more computation per gradient eval, it is possible and perhaps likely that each will outperform the other for certain datasets, since the performance of HMC sampling can be sensitive to the details of the shape of the posterior.

Xavier_La_Rochelle · July 1, 2024, 6:16pm

Is this an empirical observation, or an intuition?

I’ve tried both with simulated data and the code from rstanarm is typically much faster. However, the comparison isn’t exactly fair because the stan_polr also implements the R2D2 prior on \boldsymbol{\beta} (although it does in a weird way, using a unit_vector instead of a simplex). I would like to experiment with the R2D2 prior myself but I don’t exactly understand how to implement it with ordered logistic regression. I discuss this in my last reply from this thread if you’d like to take a look.

If you’re interested in digging here, you might check whether the two approaches are literally the same underlying parameterization, with one writing out in Stan code what the other hides behind the relatively more complicated simplex constraint.

By “writing out the simplex constraint”, do you mean doing something similar to what’s described in this “change of variable” example where the lognormal distribution is coded as the logarithm of y > 0 is assigned a normal distribution? So something like this :

parameters {
  real<lower=0> y;
}
model {
  log(y) ~ normal(mu, sigma);
  target += -log(y);
}

The stan manual chapter you linked doesn’t really give code examples so I’m not sure how I would write them out.

jsocolar · July 1, 2024, 6:22pm

Yes. When you declare constrained parameters in Stan, under the hood the model works with a set of unconstrained parameters which are subjected to an appropriate constraining transform, and the target density is automatically incremented by the associated Jacobian adjustment.

edm · July 1, 2024, 9:59pm

About numerical computation of the determinant: I think this is not needed because the form of the Jacobian matrix is simple enough to figure out the determinant using the Leibniz formula. Here is some more info:

github.com/betanalpha/knitr_case_studies

Ordinal regression determinant

opened 06:37PM - 26 Sep 21 UTC

ecmerkle

I like your ordinal regression case study, and the approach to setting priors on… the cutpoints is relevant to some models I am working on. I think that the determinant of that Jacobian matrix does have an analytic determinant, though. I have not written up a full proof, but I am convinced that the determinant is the product of the diagonal times the matrix dimension. In R, it would be something like ```r prod(diag(J)) * nrow(J) ``` Here is some code for testing this expression on matrices of the specified structure, with random entries: ```r nrep <- 100 ## dimension of matrix: nr <- sample(3:20, 100, replace=TRUE) res <- matrix(NA, nrep, 2) for(i in 1:nrep){ ## matrix entries: ents <- rnorm((nr[i] - 1), sd=2) tmpmat <- matrix(0, nr[i], nr[i]) tmpmat[,1] <- 1 for(j in 1:(nr[i] - 1)){ tmpmat[j:(j+1), (j+1)] <- c(ents[j], -ents[j]) } res[i,1] <- det(tmpmat) res[i,2] <- prod(diag(tmpmat)) * nr[i] } all.equal(res[,1], res[,2]) ```

harrelfe · July 3, 2024, 11:03am

It’s still not clear why it’s worth the trouble. Putting priors on the induced multinomial probabilities using Dirichlet with a well-chosen concentration parameter has worked extremely well for me, and it’s fast.

Xavier_La_Rochelle · July 3, 2024, 4:37pm

It’s still not clear why it’s worth the trouble.

My question exactly. There is the comment of @jsocolar about the hidden assumptions of stan around ordered vs simplex vectors but I’m not sure how I would test those assumptions. Working through this problem, I think I’ve managed to write both transformation in R code :

# Ordered transform
K <- 5
y <- rnorm(K)
x <- rep(0, K)
x[1] <- y[1]
for (k in 2:K) {
    x[k] <- y[1] + sum(exp(y[2:k]))
}

# Simplex transform
inv_logit <- function(x) {
    1/(1 + exp(-x))
}
K <- 5
y <- rnorm(K - 1)
z <- rep(0, K - 1)
for (k in 1:(K - 1)) {
    z[k] = inv_logit(y[k] + log(1/(K - k)))
}
x <- rep(0, K)
for (k in 1:(K - 1)) {
    x[k] = (1 - sum(x[0:(k - 1)])) * z[k]
}
x[K] = 1 - sum(x[1:(K - 1)])

The code works but I’m still trying to figure out what the next step is. @jsocolar Could you help me out from here? I’ll keep working on it.

jsocolar · July 3, 2024, 8:51pm

If one parameterization is working well for you, then just use it. If it’s of academic interest, you might check whether the two ways of expressing the model are literally the same parameterization or not. If they are literally the same parameterization, then whichever one yields a faster computation time per gradient eval should always be more efficient. If they turn out not to be the same parameterization, then it possible (and perhaps likely in practice) that there could be some datasets for which either outperforms the other computationally.

There are no hidden assumptions that need testing here. The statistical assumptions/models expressed in these approaches are the same. The only question is whether and when one or the other is computationally superior.

harrelfe · July 4, 2024, 11:38am

These are all good comments, and computational efficient does need to play an important role. We also need to have external benchmarks. For me it’s recover y of the MLE of intercepts by the posterior mode when all non-intercept priors are flat. Dirichlet with a specific k-dependent concentration parameter (k = # distinct levels of Y) does that to within a small tolerance.

Topic		Replies	Views
RNG function for induced dirichlet distribution Modeling	3	450	February 28, 2022
Why hierarchical model get a lot of warnings Modeling rstan , specification	14	687	June 24, 2023
Empirical Bayesian ordinal regression model rstanarm	11	1028	March 7, 2020
Regression with parameter constraints Modeling	3	1620	November 2, 2017
How to use a Covariance matrix prior in Bayesian ordinal regression model? rstanarm	3	681	March 18, 2020

Jacobian determinant adjustment for Rstanarm's implementation of the induced dirichlet prior in stan_polr

Related topics