Understanding the disc parameter in ordered logit models

Hi everyone,

As part of my PhD research, I work with a lot of ordinal variables. As such, I was really pleased to see that @paul.buerkner and @matti’s recent tutorial paper on the matter. One of the key points they make is about modelling the variance of the latent Y distribution using the disc parameter in brms.

I have a few of questions about interpreting unequal variances. If anyone would be kind enough to answer, I’d really appreciate it.

First, how is disc/latent variance related to the thresholds? Using the stem cell example in the paper, if we expect liberals to have less variation than moderates, my impression would be that we would expect the thresholds for the liberals to be closer together than those for the moderates. Is this true?

Second, if so, what is the difference between a model with varying thresholds, e.g. bf(rating ~ 1 + (1 | belief)), and a model with unequal variances, e.g. bf(rating ~ 1 + belief) + lf(disc ~ 0 + belief, cmc = FALSE)?

Finally, in terms of notation, is there any convention around how we might write out the linear predictor components of the latent variance parameter? Imagine, for instance, that we have a simple experiment with a treatment T and a cumulative ordinal attitudinal outcome on a 1-5 scale, O. My impression is that the model assuming equal variance would be written something like:

O_{i} \sim \text{Cumulative}(\eta_{i}, \tau) \\ \eta_{i} = \beta T_{i} \\ \beta \sim Normal(0, 1) \\ \tau_t \sim \text{Normal}(0, 1.5) \text{, for } t=1..4

But this notation doesn’t make explicit the latent scale. As such, I’m not sure how one would extend it to including linear predictors of the latent variance parameter. Any advice? Sorry is this last point is banal – I have to teach myself this stuff in my own time and I’m aware that I need to up my notation game.

First, how is disc /latent variance related to the thresholds? Using the stem cell example in the paper, if we expect liberals to have less variation than moderates, my impression would be that we would expect the thresholds for the liberals to be closer together than those for the moderates. Is this true?

If everything else is the same, yes. That is, if they otherwise (except for the variance) have the same distribution.

Second, if so, what is the difference between a model with varying thresholds, e.g. bf(rating ~ 1 + (1 | belief)) , and a model with unequal variances, e.g. bf(rating ~ 1 + belief) + lf(disc ~ 0 + belief, cmc = FALSE) ?

(1 | belief) will not imply a set of separate thresholds for the belief groups. Rather the same constant (varying across belief groups) will be added to all thresholds. This just implies a shift in the thresholds without changing how close they are.

With regard to the notation: The cumulative distribution always has to be coupled with a latent continuous distribution F, whose inverse I call the link function of the cumulative family (see the paper for details). What exactly the variance is and how disc relates to depends on F. In the paper we have an example with F being the standard normal distribution in which case disc = 1 / latentSD.

1 Like

Thanks a lot, Paul. That’s helped to clarify a few things!

Hey y’all,

Thank you for this thread. I’ve been working on fitting ordered probit models for a while and would appreciate a little peer review to make sure my understanding is correct. Say I have an ordinal variable y with K categories. I’d like to fit a cumulative model it with the probit link,

p(y = k | \mu, \sigma, \{ \tau_1,...,\tau_{K - 1} \}) = \Phi((\tau_k - \mu) / \sigma) - \Phi((\tau_{k - 1} - \mu) / \sigma),

where \mu is the mean, \sigma is the standard deviation, \{\tau_1,...,\tau_{K - 1} \} are the thresholds, and \Phi(y) cumulative normal distribution function. As the model is undetermined, brms solves the problem by using the standardized cumulative normal \Phi(z) (i.e., \mu = 0 and \sigma = 1). Presuming no predictors, one might rewrite the model using \Phi(z) as

p(y = k | 0, 1, \{\tau_1,...,\tau_{K - 1} \}) = \Phi(\tau_k) - \Phi(\tau_{k - 1}).

I know I’m being a little verbose by retaining the constants for \Phi(z) in the left side of the equation, but I’m building. Anyway, I might fit the mode with brms like this.

fit1 <- brm(data = data,
            family = cumulative(probit),
            y ~ 1,
            prior(normal(0, 4), class = Intercept))

As the only parameters being estimated in the model are the thresholds, I believe I could describe my priors as

\tau_1,...,\tau_{K - 1} \sim \operatorname{Normal} (0, 4).

Does this all sound correct, so far?

Yes, you are estimating (number of response categories) - 1 thresholds, each with N(0, 4) prior. When in doubt, check out:

library(brms)
make_stancode(
  data = data.frame(y = c(1,2,3,4)),
  family = cumulative(probit),
  y ~ 1,
  prior(normal(0, 4), class = Intercept)
)
// generated with brms 2.11.1
{a bunch of stuff clipped out in what follows}
data {
  // nthres is 3, check with make_standata()
  int<lower=2> nthres;  // number of thresholds
}
parameters {
  // temporary thresholds for centered predictors
  ordered[nthres] Intercept;
}
model {
   // priors including all constants--see here
  target += normal_lpdf(Intercept | 0, 4);
}
1 Like

This is helpful. Thank you @matti! Let me build further.

Let’s say I want to add a single two-category nominal predictor for the latent \mu, called \text{group}. Here \text{group} is dummy coded 0/1. The brms code could be:

fit2 <- brm(data = data,
            family = cumulative(probit),
            y ~ 1 + group,
            prior = c(prior(normal(0, 4), class = Intercept),
                      prior(normal(0, 2), class = b)

I believe I could express the corresponding statistical model as

\begin{align*} p(y = k | \mu, 1, \{ \theta_j \}) & = \Phi(\theta_k - \mu) - \Phi(\theta_{k - 1} - \mu) \\ \mu & = \beta_0 + \beta_1 \text{group} \\ \beta_0 & = 0 \\ \beta_1 & \sim \operatorname{Normal} (0, 2) \\ \theta_1,...,\theta_{K - 1} & \sim \operatorname{Normal} (0, 4). \end{align*}

With this parameterization, the underlying Gaussians for all levels of \text{group} have a standard deviation of 1. The mean for the reference category for \text{group} is still fixed to 0. As a consequence, \beta_1 is in standardized-mean difference metric, analogous to a Cohen’s d.

Still correct?

With this parameterization, the underlying Gaussians for all levels of group have a standard deviation of 1.

Yes.

The mean for the reference category for group is still fixed to 0.

Yes.

As a consequence, β1 is in standardized-mean difference metric, analogous to a Cohen’s d .

Yes, analogous, but there are subtle differences to cohens d. I think cohens d is typically calculated (I know little about it) with some pooled standard deviation, whereas b1 here is on the scale of the latent variable for the reference group. So if you have modeled a different scale for the latent variable for group=1, then b1 is still the distance in latent means on the scale of the latent variable for group=0. You can of course get a difference in means standardized to some pooled SD by applying computations to the posterior samples, after fitting the model.

1 Like

Great. This is all groundwork to my actual question, which it appears you’re already anticipating.

Now imagine we expand fit2 to allow the variances to differ by \text{group}. Based on your tutorial paper with @paul.buerkner, the brms code could be:

fit3 <- brm(data = data,
            family = cumulative(probit),
            bf(Y ~ 1 + group) +
              lf(disc ~ 0 + group, cmc = FALSE),
            prior = c(prior(normal(0, 4), class = Intercept),
                      prior(normal(0, 2), class = b),
                      prior(normal(0, 1), class = b, dpar = disc))

I’m not sure what the typical Greek notation for the discrimination parameter is, so I’m just going to follow brms notation and call it disc. I believe I could express the corresponding statistical model as

\begin{align*} p(y = k | \mu, \sigma, \{ \theta_j \}) & = \Phi((\theta_k - \mu)/ \sigma) - \Phi((\theta_{k - 1} - \mu)/ \sigma) \\ \mu & = \beta_0 + \beta_1 \text{group} \\ \beta_0 & = 0 \\ \beta_1 & \sim \operatorname{Normal} (0, 2) \\ \sigma & = 1 / disc \\ \log (disc) & = \gamma_{[\text{group}]} \\ \gamma_{[\text{group = A}]} & = 0 \\ \gamma_{[\text{group = B}]} & \sim \operatorname{Normal} (0, 1) \\ \theta_1,...,\theta_{K - 1} & \sim \operatorname{Normal} (0, 4). \end{align*}

In this way, \text{group = A} is still defined as having a standard normal distribution. \text{group = B} now has a mean defined in terms of its deviation from 0 and with a standard deviation of 1 / \exp( \log (disc)). As you say, it is no longer the case that \beta_1 is in a standardized-mean difference metric because we no longer have a common standard deviation for the two levels of \text{group}.

Correct?

Also, given we have two scales for \sigma, how does one interpret the \tau parameters? Are they still based on the \Phi(z) for both levels of \text{group} or are they interpreted conditional on the scale for each level of \text{group} somehow?

As you say, it is no longer the case that β1 is in a standardized-mean difference metric because we no longer have a common standard deviation for the two levels of group .
Correct?

I guess you can still call it a standardized mean difference metric. (What does that mean anyway?) But more precisely, it is the difference in means of the latent variable in terms of the SD of the reference group’s latent variable (1). When in doubt; conditional_effects(model), and take a look at how the response proportions differ across groups–that usually makes your effects really clear to interpret.

Also, given we have two scales for σ , how does one interpret the τ parameters? Are they still based on the Φ(z) for both levels of group or are they interpreted conditional on the scale for each level of group somehow?

The thresholds are also on the scale of the reference group’s latent variable. There is now an option in brms to model different thresholds for groups, but I am not familiar with how it works, and if and how it would impact this issue.

In general, you can take a look at

make_stancode(
  data = data.frame(group = c(rep(0, 4), rep(1, 4)), Y = c(1,2,3,4,3,4,4,4)),
  family = cumulative(probit),
  formula = bf(Y ~ 1 + group) +
    lf(disc ~ 0 + group, cmc = FALSE),
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 1), class = b, dpar = disc))
)

together with

make_standata(
    data = data.frame(group = c(rep(0, 4), rep(1, 4)), Y = c(1,2,3,4,3,4,4,4)),
    family = cumulative(probit),
    formula = bf(Y ~ 1 + group) +
      lf(disc ~ 0 + group, cmc = FALSE),
    prior = c(prior(normal(0, 4), class = Intercept),
              prior(normal(0, 2), class = b),
              prior(normal(0, 1), class = b, dpar = disc))
)

to verify that your model equations are correct (and generally to see how things are working.) But of course this assumes you are familiar with Stan code.

I hope that helps… Ordinal models can be very tricky to interpret if you’re not used to them (I am not sure I still am).

1 Like

Yeah, I’m not fluent in Stan code, which limits how much I can glean from make_standata(). But thank you for the responses. This has been helpful.

1 Like

For anyone who stumbles upon this thread:

I wrote a function that takes arbitrary mean, discrimination, and threshold values and outputs a 5-point Likert-type scale. I made it to help me understand what was going on, perhaps it will help others too.

Code is below and, as always, keen to hear how I can replace that awful for loop.

rlikert <- function(n = 1e3,
                    t1 = -0.84,
                    t2 = -0.25,
                    t3 = 0.25,
                    t4 = 0.84,
                    mean = 0,
                    disc = 0){

# Load tidyverse
require(tidyverse)

# Calculate standard deviation from discrimination
sd <- 1/exp(disc)

# Create empty dataframe
dta <- tibble(n = 1:n)

# Calculate probabilities of each response for each individual
dta <- 
    dta %>% 
    mutate(
      p1 = pnorm(t1, mean, sd),
      p2 = pnorm(t2, mean, sd) - pnorm(t1, mean, sd),
      p3 = pnorm(t3, mean, sd) - pnorm(t2, mean, sd),
      p4 = pnorm(t4, mean, sd) - pnorm(t3, mean, sd),
      p5 = 1 - pnorm(t4, mean, sd)
    )

  # For each individual, sample a single response category
  dta <- dta %>% mutate(resp = NA)

  for (i in 1:nrow(dta)) {
    dta$resp[i] <- 
      sample(
        x = c(1:5), 
        size = 1, 
        prob = c(dta$p1[i], dta$p2[i], dta$p3[i], dta$p4[i], dta$p5[i])
      )
  }

  # Return outcome scale
  dta$resp
}
1 Like