Power Analysis For Ordinal Regressions

Hello,

I am interested in examining the differences b/w the likert-scale ratings of four groups by using an logistic ordinal regression (scale score ~ 1 + group). Having looked at @Solomon ‘s blogs on such a workflow, I understand that the basic workflow is to simulate from the underlying distribution of interest and vary sample sizes to see the precision with which a regression recovers the pertinent effect of interest.

This seems straightforward for most other variable types (and Solomon has extensively provided neat workfllows for these), but I am quite clueless about how to do this for likert-scale type responses. How do you simulate such observations? How do you specify the effect in terms of an odds ratio? Or do you specify it on the latent-scale of the ordinal variable?

If anyone has a worked example/workflow/scripts/papers/blogs that do this, their help would be MUCH MUCH appreciated. @matti @paul.buerkner @Erik_Ringen

I intended to write up a post on power analyses for ordinal data, but never got around to it. This was in part because I didn’t have a pressing need for the post, but also in part because it’d be a pain in the neck.

To get you started and/or get the conversation rolling along on this thread, here’s a quick way to simulate ordinal data. Say you have Likert-type ratings of 1 through 5. You want to assign each rating with a probability, and the probability of the five ratings should sum to one. Since that can be tricky, I like to use an intermediary step of setting a weight for each option, and then converting those weights to probabilities. Below I use the weights to describe a triangular distribution, where the middle rating the the most probable, and with diminishing probabilities as you move away from the middle. However, one could define any arbitrary distribution with the weights; e.g., rep(1, times = 5) for a uniform distribution. I show how to do that with the d_weight data frame. You can then use those probabilities to sample from the Likert-type ratings with the sample() function.

# Load
library(tidyverse)

# Make the rating probabilities
d_weight <- data.frame(ordinal_rating = 1:5) |> 
  mutate(weight = c(1:3, 2, 1)) |> 
  mutate(probability = weight / sum(weight))

# Define your desired sample size
n <- 100

# Sample ordinal ratings
sample(x = 1:5, size = n, replace = TRUE, prob = pull(d_weight, probability))

Note how in this context the item probabilities are fixed. If you wanted to plan on comparing them between two groups, you could simply make two versions of d_weight, one for each group. But even that wouldn’t directly correspond to the typical equation in an ordinal model. That’s where the real headache begins… Very doable, but it’d take a lot of trial and error, and a keen understanding of the equations implied by your intended ordinal model.

1 Like

Thanks so much for starting this conversation!! As suggested, for simplicity, I took your code and tried contrasting between two groups for the time-being

# Make the rating probabilities
d_weight <- data.frame(ordinal_rating = 1:5) |> 
  mutate(weight = c(1:3, 2, 1)) |> 
  mutate(probability = weight / sum(weight))

# Make the rating probabilities for the second group
d_weight_2 <- data.frame(ordinal_rating = 1:5) |> 
  mutate(weight = c(1, 3, 3, 2, 1)) |> 
  mutate(probability = weight / sum(weight))

# Define your desired sample size
n <- 100

# Sample ordinal ratings
g1 <- sample(x = 1:5, size = n, replace = TRUE, prob = pull(d_weight, probability))
g2 <- sample(x = 1:5, size = n, replace = TRUE, prob = pull(d_weight_2, probability))


## Now combine this into a single dataframe
combined_df_direct <- tibble(
  rating = c(g1, g2), 
  group = rep(c("Group 1", "Group 2"), each = n)
)


## Let's run the model
m1 <- brm(rating ~ 1 + group,
          data = combined_df_direct,
          family = cumulative("logit"),
          prior = c(prior(normal(0, 1.5), class = "Intercept"),
                    prior(normal(0, 1), class = "b")),
          warmup = 1000,
          iter = 5000)

I think I got the hang of this, especially when I make the model spit out predictions using conditional_effects(m1, categorical = TRUE). It’s easy to see here how well the model is recovering the probabilities of each rating. Currently, not such a great job with 100 observations as Option 2 in group 2 should have a higher probability.

However, look at the summary model output.

Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: rating ~ 1 + group 
   Data: combined_df_direct (Number of observations: 200) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -2.24      0.27    -2.78    -1.72 1.00     9951     9688
Intercept[2]    -0.51      0.19    -0.89    -0.14 1.00    16380    14500
Intercept[3]     0.96      0.20     0.58     1.36 1.00    15945    13463
Intercept[4]     2.55      0.28     2.02     3.11 1.00    15616    13255
groupGroup2      0.25      0.25    -0.23     0.74 1.00    14558    11385

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

As we are simulating by changing the probability of each rating between the two groups manually using the weight dataframes, how does one know the effect size on the log-odds scale a priori before running the model (i.e. the estimate groupGroup2)? After all, it is this contrast that we are interested in estimating, and I don’t have straightforward intuitions of how to glean “one effect size” directly from the weight function itself?

Some additional questions:

  1. To then extend this workflow to multiple groups (i.e. I am interested in estimating contrasts of the first group with three other groups, hence the indicator variable approach works extremely well), is it okay to just run the simulation for the contrast between two groups and then simply extrapolate that the other groups need similar sample sizes to estimate those contrasts? Or does one need to simulate observations for all the groups?
  2. How would one update this workflow for a multilevel model? For instance, in my data context, while participants belong to certain groups, they are also clustered in certain sequences. That is, the final model of interest is rating ~ 1 + group + (1| sequence)

BTW, while doing some deep dives on this topic, I found a r package that purportedly claims to simulate likert scale data. See workflow below and this vignette: https://latent2likert.lalovic.io/

## model summary
summary(m1)
conditional_effects(m1, categorical = TRUE)


## Alternative Workflow using latent2likert workflow

## load 
library(latent2likert)
set.seed(54543)

## Simulate observations using this package
# Note that I have introduced the skew parameter as I know from previous pilot data that
# people respond in a skewed manner to this item
g1_v2 <- rlikert(size = 100, n_levels = 5, n_items = 1, mean = 0, skew = 0.8)
g2_v2 <- rlikert(size = 100, n_levels = 5, n_items = 1, mean = 0.3, skew = 0.8)

## Now combine this into a single dataframe
combined_df_direct_v2 <- tibble(
  rating = c(g1_v2, g2_v2), 
  group = rep(c("Group 1", "Group 2"), each = n)
)   

## Let's run the model again
m1_newdata <- update(
  m1,
  newdata = combined_df_direct_v2)

## model summary
summary(m1_newdata)
conditional_effects(m1_newdata, categorical = TRUE)

Spits out the following predictions

Thought that I now know the latent effect size (mean = 0.3) using this workflow, but am still struggling with the log-odds model output (i.e. the groupGroup2 coefficient) below:

> summary(m1_newdata)
 Family: cumulative 
  Links: mu = logit; disc = identity 
Formula: rating ~ 1 + group 
   Data: combined_df_direct_v2 (Number of observations: 200) 
  Draws: 4 chains, each with iter = 5000; warmup = 1000; thin = 1;
         total post-warmup draws = 16000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1]    -1.86      0.25    -2.37    -1.39 1.00     8975    10576
Intercept[2]    -0.15      0.19    -0.52     0.23 1.00    17333    13124
Intercept[3]     0.94      0.20     0.55     1.34 1.00    17303    14307
Intercept[4]     2.50      0.29     1.96     3.08 1.00    16376    13415
groupGroup2      0.15      0.25    -0.32     0.64 1.00    16995    10672

Family Specific Parameters: 
     Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

What should I transform this estimate to? Certainly not odds-ratio as that isn’t the scale on which I originally defined the effect size?

Your initial question before numbers 1 and 2 (how do we think about the contrast) isn’t trivial. I walked it out a bit in this blog post on causal inference for ordinal experimental data. In short, you have options, and it isn’t straightforward which option is best. The post might also help clarify what the {latent2likert} package authors mean by a latent effect size. Causal inference with ordinal regression | A. Solomon Kurz

1 Like

Yup, I remember seeing this blog post while consulting on another project (thanks so much for that!!!). By latent effect size, I am aware that the latent2likert scale package authors are specifying the latent SD on the latent continuous variable assumed to underly the simulated ordinal ratings, which the probit versions of the previous models are able to recover properly. Also, please correct me if I am wrong, but it seems like the latent2likert package has the advantage of atleast pre-specifying a latent effect size which the manually specified weight functions might not enjoy?

My initial question stemmed more from how to make sense of the estimate when it is a logit ordinal model, as opposed to the probit one? How does one convert the log-odds to the latent effect size or SD, to see whether it is indeed recovering the pre-specified effect properly?

Also, if you could provide inputs on my initial questions 1 and 2, that would be super super helpful :))

I haven’t looked closely enough at the {latent2likert} package to comment on its specific strengths or weaknesses.

As to the logit/probit issue, there are known conversion formulas to go from one to the other. I don’t recall them offhand, but I’d bet they’d pop up pretty quickly with an online search. I’d say, however, that if you’re leaning into latent Cohen’s d way of thinking, it might be easier to just use a probit model. I’m not aware of any major advantages of the logit over the probit.

If you’re only interested in the comparisons of one group to another group, it might be a straightforward generalization from the 2-group case to 3+. However, if you wanted to do fancier things like compare the average of two of the groups to a third group, that would be more complicated. There’s also the issue of whether the groups would all have the same expected sample size, or if they’d differ for some reason.

To generalize to the multilevel model, I’d recommend first mastering the single-level case. Then once that’s running smoothly, see if you can start simulating multilevel ordinal data, and fitting reasonable models to them. Once that’s working well, you can start wrapping it into a power simulation function. As a general guideline, I make many small steps when I’m building up to a multilevel power simulation function.

1 Like

Thanks Solomon :)) all of this makes sense, I will aim to come back and post about the multilevel version, once I have mastered the single-level case just to ensure whether I am on the right track + if I am, would be useful for similar others to see.

Stats would be way harder (and less fun) without folks like you around, so thanks so much for all your contributions!!

1 Like

Thanks for the kind words. Just keep posting and asking good questions, and the next crop of new users will benefit from your efforts, too. Cheers!

Thanks for this excellent blog post!

I’m curious about the choice to model post scores as ordinal but to include the pre scores as metric.

Would it make sense to use a joint model of pre scores or the mofunction in brms for consistency?

I have advised on other projects where we faced similar problems. A way out could be the following:

Rating ~ (1 + Main Effect | Time) + (1|SubID),

where time is a dummy variable, with 0 = pre-score and 1 = post-score; Sub-ID because now each participant has 2 ratings i.e. pre and post. This way, you don’t need to get into the rabbit hole of whether to specify pre-scores as categorical, monotonic or standardise it and treat it as continuous.

I believe Solomon has proposed similarish solutions here: Just use multilevel models for your pre/post RCT data | A. Solomon Kurz

Thanks for the kind words, @JLC. Yes, you could do that, too. It’d just be one more complication to explain in an already long blog post.