Understanding the disc parameter in ordered logit models

jackbailey · May 22, 2019, 10:45am

Hi everyone,

As part of my PhD research, I work with a lot of ordinal variables. As such, I was really pleased to see that @paul.buerkner and @matti’s recent tutorial paper on the matter. One of the key points they make is about modelling the variance of the latent Y distribution using the disc parameter in brms.

I have a few of questions about interpreting unequal variances. If anyone would be kind enough to answer, I’d really appreciate it.

First, how is disc/latent variance related to the thresholds? Using the stem cell example in the paper, if we expect liberals to have less variation than moderates, my impression would be that we would expect the thresholds for the liberals to be closer together than those for the moderates. Is this true?

Second, if so, what is the difference between a model with varying thresholds, e.g. bf(rating ~ 1 + (1 | belief)), and a model with unequal variances, e.g. bf(rating ~ 1 + belief) + lf(disc ~ 0 + belief, cmc = FALSE)?

Finally, in terms of notation, is there any convention around how we might write out the linear predictor components of the latent variance parameter? Imagine, for instance, that we have a simple experiment with a treatment T and a cumulative ordinal attitudinal outcome on a 1-5 scale, O. My impression is that the model assuming equal variance would be written something like:

O_{i} \sim \text{Cumulative}(\eta_{i}, \tau) \\ \eta_{i} = \beta T_{i} \\ \beta \sim Normal(0, 1) \\ \tau_t \sim \text{Normal}(0, 1.5) \text{, for } t=1..4

But this notation doesn’t make explicit the latent scale. As such, I’m not sure how one would extend it to including linear predictors of the latent variance parameter. Any advice? Sorry is this last point is banal – I have to teach myself this stuff in my own time and I’m aware that I need to up my notation game.

paul.buerkner · May 24, 2019, 3:30am

First, how is disc /latent variance related to the thresholds? Using the stem cell example in the paper, if we expect liberals to have less variation than moderates, my impression would be that we would expect the thresholds for the liberals to be closer together than those for the moderates. Is this true?

If everything else is the same, yes. That is, if they otherwise (except for the variance) have the same distribution.

Second, if so, what is the difference between a model with varying thresholds, e.g. bf(rating ~ 1 + (1 | belief)) , and a model with unequal variances, e.g. bf(rating ~ 1 + belief) + lf(disc ~ 0 + belief, cmc = FALSE) ?

(1 | belief) will not imply a set of separate thresholds for the belief groups. Rather the same constant (varying across belief groups) will be added to all thresholds. This just implies a shift in the thresholds without changing how close they are.

With regard to the notation: The cumulative distribution always has to be coupled with a latent continuous distribution F, whose inverse I call the link function of the cumulative family (see the paper for details). What exactly the variance is and how disc relates to depends on F. In the paper we have an example with F being the standard normal distribution in which case disc = 1 / latentSD.

jackbailey · May 24, 2019, 7:10am

Thanks a lot, Paul. That’s helped to clarify a few things!

Solomon · January 22, 2020, 5:30pm

Hey y’all,

Thank you for this thread. I’ve been working on fitting ordered probit models for a while and would appreciate a little peer review to make sure my understanding is correct. Say I have an ordinal variable y with K categories. I’d like to fit a cumulative model it with the probit link,

p(y = k | \mu, \sigma, \{ \tau_1,...,\tau_{K - 1} \}) = \Phi((\tau_k - \mu) / \sigma) - \Phi((\tau_{k - 1} - \mu) / \sigma),

where \mu is the mean, \sigma is the standard deviation, \{\tau_1,...,\tau_{K - 1} \} are the thresholds, and \Phi(y) cumulative normal distribution function. As the model is undetermined, brms solves the problem by using the standardized cumulative normal \Phi(z) (i.e., \mu = 0 and \sigma = 1). Presuming no predictors, one might rewrite the model using \Phi(z) as

p(y = k | 0, 1, \{\tau_1,...,\tau_{K - 1} \}) = \Phi(\tau_k) - \Phi(\tau_{k - 1}).

I know I’m being a little verbose by retaining the constants for \Phi(z) in the left side of the equation, but I’m building. Anyway, I might fit the mode with brms like this.

fit1 <- brm(data = data,
            family = cumulative(probit),
            y ~ 1,
            prior(normal(0, 4), class = Intercept))

As the only parameters being estimated in the model are the thresholds, I believe I could describe my priors as

\tau_1,...,\tau_{K - 1} \sim \operatorname{Normal} (0, 4).

Does this all sound correct, so far?

matti · January 23, 2020, 5:56pm

Yes, you are estimating (number of response categories) - 1 thresholds, each with N(0, 4) prior. When in doubt, check out:

library(brms)
make_stancode(
  data = data.frame(y = c(1,2,3,4)),
  family = cumulative(probit),
  y ~ 1,
  prior(normal(0, 4), class = Intercept)
)

// generated with brms 2.11.1
{a bunch of stuff clipped out in what follows}
data {
  // nthres is 3, check with make_standata()
  int<lower=2> nthres;  // number of thresholds
}
parameters {
  // temporary thresholds for centered predictors
  ordered[nthres] Intercept;
}
model {
   // priors including all constants--see here
  target += normal_lpdf(Intercept | 0, 4);
}

Solomon · January 23, 2020, 6:15pm

This is helpful. Thank you @matti! Let me build further.

Let’s say I want to add a single two-category nominal predictor for the latent \mu, called \text{group}. Here \text{group} is dummy coded 0/1. The brms code could be:

fit2 <- brm(data = data,
            family = cumulative(probit),
            y ~ 1 + group,
            prior = c(prior(normal(0, 4), class = Intercept),
                      prior(normal(0, 2), class = b)

I believe I could express the corresponding statistical model as

\begin{align*} p(y = k | \mu, 1, \{ \theta_j \}) & = \Phi(\theta_k - \mu) - \Phi(\theta_{k - 1} - \mu) \\ \mu & = \beta_0 + \beta_1 \text{group} \\ \beta_0 & = 0 \\ \beta_1 & \sim \operatorname{Normal} (0, 2) \\ \theta_1,...,\theta_{K - 1} & \sim \operatorname{Normal} (0, 4). \end{align*}

With this parameterization, the underlying Gaussians for all levels of \text{group} have a standard deviation of 1. The mean for the reference category for \text{group} is still fixed to 0. As a consequence, \beta_1 is in standardized-mean difference metric, analogous to a Cohen’s d.

Still correct?

matti · January 23, 2020, 6:52pm

With this parameterization, the underlying Gaussians for all levels of group have a standard deviation of 1.

Yes.

The mean for the reference category for group is still fixed to 0.

Yes.

As a consequence, β1 is in standardized-mean difference metric, analogous to a Cohen’s d .

Yes, analogous, but there are subtle differences to cohens d. I think cohens d is typically calculated (I know little about it) with some pooled standard deviation, whereas b1 here is on the scale of the latent variable for the reference group. So if you have modeled a different scale for the latent variable for group=1, then b1 is still the distance in latent means on the scale of the latent variable for group=0. You can of course get a difference in means standardized to some pooled SD by applying computations to the posterior samples, after fitting the model.

Solomon · January 23, 2020, 7:25pm

Great. This is all groundwork to my actual question, which it appears you’re already anticipating.

Now imagine we expand fit2 to allow the variances to differ by \text{group}. Based on your tutorial paper with @paul.buerkner, the brms code could be:

fit3 <- brm(data = data,
            family = cumulative(probit),
            bf(Y ~ 1 + group) +
              lf(disc ~ 0 + group, cmc = FALSE),
            prior = c(prior(normal(0, 4), class = Intercept),
                      prior(normal(0, 2), class = b),
                      prior(normal(0, 1), class = b, dpar = disc))

I’m not sure what the typical Greek notation for the discrimination parameter is, so I’m just going to follow brms notation and call it disc. I believe I could express the corresponding statistical model as

\begin{align*} p(y = k | \mu, \sigma, \{ \theta_j \}) & = \Phi((\theta_k - \mu)/ \sigma) - \Phi((\theta_{k - 1} - \mu)/ \sigma) \\ \mu & = \beta_0 + \beta_1 \text{group} \\ \beta_0 & = 0 \\ \beta_1 & \sim \operatorname{Normal} (0, 2) \\ \sigma & = 1 / disc \\ \log (disc) & = \gamma_{[\text{group}]} \\ \gamma_{[\text{group = A}]} & = 0 \\ \gamma_{[\text{group = B}]} & \sim \operatorname{Normal} (0, 1) \\ \theta_1,...,\theta_{K - 1} & \sim \operatorname{Normal} (0, 4). \end{align*}

In this way, \text{group = A} is still defined as having a standard normal distribution. \text{group = B} now has a mean defined in terms of its deviation from 0 and with a standard deviation of 1 / \exp( \log (disc)). As you say, it is no longer the case that \beta_1 is in a standardized-mean difference metric because we no longer have a common standard deviation for the two levels of \text{group}.

Correct?

Also, given we have two scales for \sigma, how does one interpret the \tau parameters? Are they still based on the \Phi(z) for both levels of \text{group} or are they interpreted conditional on the scale for each level of \text{group} somehow?

matti · January 23, 2020, 8:07pm

As you say, it is no longer the case that β1 is in a standardized-mean difference metric because we no longer have a common standard deviation for the two levels of group .
Correct?

I guess you can still call it a standardized mean difference metric. (What does that mean anyway?) But more precisely, it is the difference in means of the latent variable in terms of the SD of the reference group’s latent variable (1). When in doubt; conditional_effects(model), and take a look at how the response proportions differ across groups–that usually makes your effects really clear to interpret.

Also, given we have two scales for σ , how does one interpret the τ parameters? Are they still based on the Φ(z) for both levels of group or are they interpreted conditional on the scale for each level of group somehow?

The thresholds are also on the scale of the reference group’s latent variable. There is now an option in brms to model different thresholds for groups, but I am not familiar with how it works, and if and how it would impact this issue.

In general, you can take a look at

make_stancode(
  data = data.frame(group = c(rep(0, 4), rep(1, 4)), Y = c(1,2,3,4,3,4,4,4)),
  family = cumulative(probit),
  formula = bf(Y ~ 1 + group) +
    lf(disc ~ 0 + group, cmc = FALSE),
  prior = c(prior(normal(0, 4), class = Intercept),
            prior(normal(0, 2), class = b),
            prior(normal(0, 1), class = b, dpar = disc))
)

together with

make_standata(
    data = data.frame(group = c(rep(0, 4), rep(1, 4)), Y = c(1,2,3,4,3,4,4,4)),
    family = cumulative(probit),
    formula = bf(Y ~ 1 + group) +
      lf(disc ~ 0 + group, cmc = FALSE),
    prior = c(prior(normal(0, 4), class = Intercept),
              prior(normal(0, 2), class = b),
              prior(normal(0, 1), class = b, dpar = disc))
)

to verify that your model equations are correct (and generally to see how things are working.) But of course this assumes you are familiar with Stan code.

I hope that helps… Ordinal models can be very tricky to interpret if you’re not used to them (I am not sure I still am).

Solomon · January 23, 2020, 9:40pm

Yeah, I’m not fluent in Stan code, which limits how much I can glean from make_standata(). But thank you for the responses. This has been helpful.

jackbailey · January 24, 2020, 7:53am

For anyone who stumbles upon this thread:

I wrote a function that takes arbitrary mean, discrimination, and threshold values and outputs a 5-point Likert-type scale. I made it to help me understand what was going on, perhaps it will help others too.

Code is below and, as always, keen to hear how I can replace that awful for loop.

rlikert <- function(n = 1e3,
                    t1 = -0.84,
                    t2 = -0.25,
                    t3 = 0.25,
                    t4 = 0.84,
                    mean = 0,
                    disc = 0){

# Load tidyverse
require(tidyverse)

# Calculate standard deviation from discrimination
sd <- 1/exp(disc)

# Create empty dataframe
dta <- tibble(n = 1:n)

# Calculate probabilities of each response for each individual
dta <- 
    dta %>% 
    mutate(
      p1 = pnorm(t1, mean, sd),
      p2 = pnorm(t2, mean, sd) - pnorm(t1, mean, sd),
      p3 = pnorm(t3, mean, sd) - pnorm(t2, mean, sd),
      p4 = pnorm(t4, mean, sd) - pnorm(t3, mean, sd),
      p5 = 1 - pnorm(t4, mean, sd)
    )

  # For each individual, sample a single response category
  dta <- dta %>% mutate(resp = NA)

  for (i in 1:nrow(dta)) {
    dta$resp[i] <- 
      sample(
        x = c(1:5), 
        size = 1, 
        prob = c(dta$p1[i], dta$p2[i], dta$p3[i], dta$p4[i], dta$p5[i])
      )
  }

  # Return outcome scale
  dta$resp
}

Topic		Replies	Views
Intraclass correlation for ordered logit models in brms Modeling techniques , interpret-results , brms	2	1168	January 26, 2022
Cannot specify `disc` in ordinal 2PL model: variable `disc` cannot be found in the data brms specification	10	667	March 9, 2021
Get and interpret thresholds in a Cumulative probit model: the case of a ordinal predicted variable and a metric predictor General techniques , brms	2	572	February 11, 2023
Scale mismatch in the 'disc' Parameter between prior and data for Ordinal IRT Models in brms Modeling specification , brms	3	51	March 24, 2025
The Heteroskedastic Probit Model in brms (w/ Bernoulli or Binomial) brms binomial-response	1	775	January 21, 2021

Understanding the disc parameter in ordered logit models

Related topics