Dirichlet prior on ordinal regression cutpoints in brms

In a detailed tutorial, @betanalpha makes the case for using a dirichlet prior in ordinal regression models (Ordinal Regression).

I confess, I wasn’t able to understand all that much of it besides the fact that a dirichlet prior sounds like a good idea. Is there any way to implement a weakly informative dirichlet prior on the cutpoints in brms? This was also mentioned by @paul.buerkner as a possible feature to develop (Dirichlet prior for cutpoints in ordinal models · Issue #762 · paul-buerkner/brms · GitHub), but I’m not sure if it was ultimately implemented.

I have seven response options in the ordinal scale, and am estimating responses using a cumulative probit distribution.

At the moment, I’ve had ‘success’ simply changing the extremely vague default of student_t(3, 0, 10) to something that much less favors extreme cutpoints: normal(0, 1.5), or student_t(5, 0, 1.5). Is there something wrong with this approach, if the dirichlet option is not possible? - it at least seems better than the default option to me!

1 Like

Right now, the dirichlet cut points are not possible with brms to my understanding.

Thanks for the quick response @paul.buerkner

The models work with my own custom priors for the intercepts but it is always nice to be able to point to someone a lot smarter who has properly worked out all the implications and follow their suggestions! Maybe I need to just get to grips with coding directly in Stan.

I think I figured it out. I borrowed the code from @betanalpha’s tutorial. Maybe a bad example model, but try it out!

dirichlet_prior <- "
  real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] sigma = inv_logit(phi - c);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    }
    
    return   dirichlet_lpdf(p | alpha)
           + log_determinant(J);
  }
"

dirichlet_prior_stanvar <- stanvar(scode = dirichlet_prior, block = "functions")

brm(formula = bf(cyl ~ mpg, center = FALSE), # center = FALSE is unnecessary for this model
    data=mtcars, 
    family=cumulative, control=list(adapt_delta=0.99)) ->
  test_standard

brm(formula = bf(cyl ~ mpg, center = FALSE), # center = FALSE is probably unnecessary 
    data=mtcars, 
    family=cumulative,
    stanvars = dirichlet_prior_stanvar, 
    prior = set_prior("induced_dirichlet(rep_vector(1, nthres+1), mean(Intercept))", class = "Intercept")) ->
  test_dirichlet

EDIT: I should add that it seems like center = FALSE is unnecessary.

loo(test_standard, test_dirichlet)
Output of model 'test_standard':

Computed from 4000 by 32 log-likelihood matrix

         Estimate  SE
elpd_loo    -28.9 1.8
p_loo         2.2 0.2
looic        57.9 3.6
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Output of model 'test_dirichlet':

Computed from 4000 by 32 log-likelihood matrix

         Estimate  SE
elpd_loo    -23.7 2.7
p_loo         2.5 0.4
looic        47.3 5.4
------
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Model comparisons:
               elpd_diff se_diff
test_dirichlet  0.0       0.0   
test_standard  -5.3       2.0 

Plot of thresholds:

2 Likes

@StaffanBetner thank you for the work you put into that! I managed to make my model run using this prior alongside my other ones - it comes out basically the exact same as my previous results.

Have you any thoughts on how the code should be adapted for a probit model? I think that most of the code in the tutorial is for an ordered logit model rather than an ordered probit model, and so I am wondering if elements of the code above are therefore more appropriate for a logit than a probit approach e.g.,

sigma = inv_logit(phi - c)

Is that code specific for logit thresholds?

Yes, since it is a transform bringing the threshold from the cumulative distribution space into the (cumulative) probability space. p[k] = sigma[k - 1] - sigma[k] calculates the differences between those probabilities, which we put a uniform Dirichlet prior (\alpha = 1) on.

I think

sigma = inv_logit(phi - c)

is the only line that needs changing for a probit model. Into:

sigma = Phi(phi - c)

EDIT: Fix in code, into Phi instead of inv_Phi. inv_logit is the CDF of the logistic distribution, Phi is the CDF of the (standard) normal distribution. A little confusing.

2 Likes

Thanks for your quick response. The minor edit doesn’t seem to work though!

SAMPLING FOR MODEL ‘c764d7e8944070096c42fd91a779c49f’ NOW (CHAIN 2).
Chain 2: Rejecting initial value:
Chain 2: Error evaluating the log probability at the initial value.
Chain 2: Exception: Exception: inv_Phi: Probability variable is 2.5, but must be in the interval [0, 1] (in ‘model42886f392dc0_c764d7e8944070096c42fd91a779c49f’ at line 41)
(in ‘model42886f392dc0_c764d7e8944070096c42fd91a779c49f’ at line 119)

Maybe it should be just Phi then. Logit and inverse-CDF is a bit confusing :)

Edit: Just tried. It should be just Phi.

Thanks Staffan - that seems to run properly. I’m not totally sure how solid a prior it is in this case, at least for my purposes. When I run it as a prior predictive check, it entertains really quite insane values, has a bulk ESS of 7 and Rhat of 2.6. I will probably stick with my more understandable priors until I get what is going on with this prior better! But it was entertaining cutpoints of several billion on the probit scale, which is really implausible for most analyses and definitely for mine.

Thanks for your help getting the model to run with this custom prior though.

I think that might be due to the anchor point mean(Intercept) (which I thought was sensible) that I guess is unidentifiable without any data. Try set it to 0 for prior predictive check. Or maybe try to put a prior on that (I am not sure how to do that right now). Maybe @betanalpha have some input?

Thanks @StaffanBetner I’ll give that a try

I should add that phi = 0 seems to shrink all thresholds a bit towards 0, which may or may not be desirable.

1 Like

If you change

sigma = inv_logit(phi - c);

to

sigma = Phi(phi - c);

I think you would also need to change the diagonal entries of the Jacobian, since

real rho = sigma[k - 1] * (1 - sigma[k - 1]);

Is specific to logit (see Betancourt’s case study) , screenshot here:

rho is the derivative of the of the CDF of logistic. You could also use this with Phi_approx = inv_logit(0.07056x^3+1.5976x) (or the less precise approximation of inv_logit(1.702*x)) by taking the derivatives to get rho.

1 Like

You are right.

To be clear about all relationships in conventional notation (i.e. f(x), F(x) and F^{-1}(x)) to avoid any confusion:

CDF of latent distribution (here logistic distribution): F(x)=\sigma(x)=\frac{1}{1+e^{-x}}=logit^{-1}(x)
Quantile function (inverse CDF): F^{-1}(x)=\lambda(x)=\sigma^{-1}(x)=\log \left(\frac{x}{1-x}\right)=logit(x) (not used in code)
Probability density function: f(x)=\rho(x)=\frac{\mathrm{d} \sigma}{\mathrm{d} x}(x)=\sigma(x) \cdot(1-\sigma(x))=logit^{-1}(x)\cdot (1-logit^{-1}(x))

So a correct more general version of @betanalpha’s code would be:

real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] anchoredcutoffs = phi - c;
    // sigma is the CDF of the latent distribution
    // logit:
    vector[K - 1] sigma = inv_logit(anchoredcutoffs); 
    // probit:
    // vector[K - 1] sigma = Phi(anchoredcutoffs);
    // or: 
    // vector[K - 1] sigma = Phi_approx(anchoredcutoffs);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      // rho is the PDF of the latent distribution
      // logit:
      real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      // probit: 
      // real rho = exp(std_normal_lpdf(anchoredcutoffs));
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    }
    
    return dirichlet_lpdf(p | alpha)
           + log_determinant(J);
  }

EDIT: Seems like the scaling is off compared to the default t-distribution prior, which often seem to perform better when comparing elpd. Any ideas what is going on?

2 Likes

Yeah, with an ordinal regression a common intercept is not identifiable relative to the cut points. I would just set the anchor point to zero and then things should behave well. Also larger “alpha” values in the Dirichlet prior specification will help contain the cut points better – the induced Dirichlet lets you think about the reasonable probabilities but you still are responsible for setting a sufficiently well-informed prior for those probabilities!

2 Likes

The “intercept” here are the cutpoints, so the idea is that using the mean of the cutpoints as the anchor point would give a better fit. I have experimented a little bit with this with some simple models, and the best (albeit only slightly better), w.r.t elpd, approach seems to be to not fix the anchor point, but instead estimate it.

The anchor point is arbitrary provided that the cut points and the latent contribution are all interpreted relative to the same anchor point. Having the anchor point be a function of the parameters, especially the cut points, requires care because that interpretation will then change with along with the parameter configuration. I won’t go into my many reservations about estimated elpd in general but I strongly suspect that any difference has more to do with estimator error than any meaningful difference in the models.

3 Likes

Really pleased to see this thread on the discourse. An example of just how flexible brms can be too. I have a similar problem where I’d like to set Dirichlet priors on parameters in a linear model (i.e. not on the intercepts). I opened another thread about this a while ago to no success: Dirichlet prior possible with stanvars?.

@StaffanBetner, any idea how I might adapt your code?

I answered in the original thread!

1 Like

Hey @StaffanBetner, I’ve been practicing with your code set for cumulative ordinal models, both with setting phi to either mean(Intercept) or to 0. Either way, the fit seems substantially worse than the default method. Where am I going wrong?

Here’s the model set up:

# load packages 
library(tidyverse)
library(brms)
library(patchwork)

# define dirichlet_prior for the probit approach
dirichlet_prior <- "
real induced_dirichlet_lpdf(vector c, vector alpha, real phi) {
    int K = num_elements(c) + 1;
    vector[K - 1] anchoredcutoffs = phi - c;
    // sigma is the CDF of the latent distribution
    // logit:
    // vector[K - 1] sigma = inv_logit(anchoredcutoffs); 
    // probit:
    vector[K - 1] sigma = Phi(anchoredcutoffs);
    // or: 
    // vector[K - 1] sigma = Phi_approx(anchoredcutoffs);
    vector[K] p;
    matrix[K, K] J = rep_matrix(0, K, K);
    
    // Induced ordinal probabilities
    p[1] = 1 - sigma[1];
    for (k in 2:(K - 1))
      p[k] = sigma[k - 1] - sigma[k];
    p[K] = sigma[K - 1];
    
    // Baseline column of Jacobian
    for (k in 1:K) J[k, 1] = 1;
    
    // Diagonal entries of Jacobian
    for (k in 2:K) {
      // rho is the PDF of the latent distribution
      // logit:
      // real rho = sigma[k - 1] * (1 - sigma[k - 1]);
      // probit: 
      real rho = exp(std_normal_lpdf(anchoredcutoffs));
      J[k, k] = - rho;
      J[k - 1, k] = rho;
    }
    
    return dirichlet_lpdf(p | alpha)
           + log_determinant(J);
}
"

# define the stanvar()
dirichlet_prior_stanvar <- stanvar(scode = dirichlet_prior, block = "functions")

# fit the conventional cumulative probit model
test_standard <- brm(
  formula = cyl ~ 1 + mpg,
  data = mtcars, 
  family = cumulative(probit), 
  cores = 4, seed = 4
  )

# fit the model with induced_dirichlet() and phi = mean(Intercept)
test_dirichlet_mean_intercept <- brm(
  formula = cyl ~ 1 + mpg,
  data = mtcars, 
  family = cumulative(probit),
  stanvars = dirichlet_prior_stanvar, 
  prior = set_prior("induced_dirichlet(rep_vector(1, nthres+1), mean(Intercept))", class = "Intercept"),
  cores = 4, seed = 4
  )

# fit the model with induced_dirichlet() and phi = 0
test_dirichlet_zero <- brm(
  formula = cyl ~ 1 + mpg,
  data = mtcars, 
  family = cumulative(probit),
  stanvars = dirichlet_prior_stanvar, 
  prior = set_prior("induced_dirichlet(rep_vector(1, nthres+1), 0)", class = "Intercept"),
  cores = 4, seed = 4
  )

A few quick plots suggest the default model is much better.

ce <- conditional_effects(test_standard)
p1 <- plot(ce, points = TRUE, plot = F)[[1]] +
  labs(subtitle = "default t prior")

ce <- conditional_effects(test_dirichlet_mean_intercept)
p2 <- plot(ce, points = TRUE, plot = F)[[1]] +
  labs(subtitle = "dirichlet w/ phi = mean(Intercept)")

ce <- conditional_effects(test_dirichlet_zero)
p3 <- plot(ce, points = TRUE, plot = F)[[1]] +
  labs(subtitle = "dirichlet w/ phi = 0")

# combine
p1 | p2 | p3

The LOO confirms the plots.

test_standard <- add_criterion(test_standard, criterion = c("loo", "waic"))
test_dirichlet_mean_intercept <- add_criterion(test_dirichlet_mean_intercept, criterion = c("loo", "waic"))
test_dirichlet_zero <- add_criterion(test_dirichlet_zero, criterion = c("loo", "waic"))

loo_compare(test_standard, test_dirichlet_mean_intercept, test_dirichlet_zero) %>% print(simplify = F)
                              elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
test_standard                   0.0       0.0   -14.8      2.6         2.4   0.5     29.6   5.2   
test_dirichlet_mean_intercept -19.4       3.6   -34.2      4.2         2.2   0.3     68.4   8.4   
test_dirichlet_zero           -25.2       3.1   -40.0      3.3         2.3   0.4     79.9   6.6   

Where have I gone wrong?

1 Like