# Incorporate data distribution into ordinal regression

I am experimenting with some ordinal regression models and I have been following the great advice that I was given in another thread (Ordinal regression to model components failure). I think I got good results, but I realized that the ordinal regression does not take into account how the data for each category is distributed. Is it possible to include this information into the model?

I simulated a dataset. We have some components that are tested under different force levels and we report the damage (1, 2, 3). For example, we stress a component with a force = 40 and it may report low (small fractures), medium, or high damage (total failure). The dataset is not balanced as we have more components that break with low damage. Within each damage level, the data is normally distributed—this the info I would like to incorporate into my model.

``````library('tidyverse')
library('brms')
library('tidybayes')

SEED <- 20220510
NCAT  <- 3 # number of damage levels (it will be the ordinal category)

set.seed(SEED)
d  <- tibble(
mean = seq(35, 40, length.out=NCAT),
sd = rep(2, times = NCAT),
n = seq(50, 20, length.out=NCAT),
damage = factor(seq(1, NCAT), ordered=T))  %>%
rowwise() %>%
mutate(force = list(rnorm(n = n, mean = mean, sd = sd)))  %>%
unnest(cols = c(force))
``````

Then, I model it with a cumulative family

``````m_ordinal_same <- brm(bf(damage ~ 1 +  force, center=T),
data = d,
family = cumulative(logit),
seed = SEED)
``````

and obtain

``````Family: cumulative
Links: mu = logit; disc = identity
Formula: damage ~ 1 + force
Data: d (Number of observations: 105)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000

Population-Level Effects:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    41.26      6.16    30.27    54.52 1.00     1744     1833
Intercept    44.76      6.59    33.09    58.90 1.00     1610     1710
force            1.13      0.17     0.82     1.49 1.00     1658     1779

Family Specific Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
disc     1.00      0.00     1.00     1.00   NA       NA       NA

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
``````

As you see from the histograms, the data is sparse for the high damage category (3) but all data comes from a normal distribution with the mean shifted in each category. Would it be good to include this info? Maybe this is not needed? Or, maybe the ordinal regression is not the right model in this case?

Take this with a grain of salt, as I have only had to work with ordinal responses a few times, but I am wondering if you are starting to think about this problem backwards?
I think the model that you have specified is aimed at looking at how an increase in a unit of force is associated with the continuous latent variable damage. Your data consists of the ordinal response variable `damage`, but your model assumes that `damage` originates from the categorization of a continuous normally distributed latent variable, we could call `damage_hat`. So if you had `damage_hat` you could just run a simple regression `brm(damage_hat ~ force)`, but instead you just have the ordered categories `damage`. But your predictor variable `force` is the same in both instances. When you say

you are talking about (and showing in the histograms) how the data for `force` is distributed in each category of `damage`.

If you want to know the mean `force` for each category then turn it around and run `brm(force ~ damage)`, but that doesn’t seem like what you are looking for based on the other post that you linked.
Am just wondering if you are confusing the distribution of the response and that of the predictor.
Or perhaps I completely misunderstand what you are getting at:)

3 Likes

Thank you @jd_c . I think you are right. I was overthinking a bit and I started to confuse predictor/outcome. I believe, the model that I specified is what I want—estimate `damage` from `force`.

The idea I had was to have two stacked models. The first model would estimate the distribution of the predictors for each damage level (in this synthetic example, the predictors are normally distributed). I thought that would help in case of sparse/little data. Then, I would use the estimated predictor distribution in the ordinal regression model. Now, I don’t know if this makes sense at all, or if it does, whether it would be beneficial in the end.

I think you got confused by seeing the distribution of outcomes spreading out differently but that is not really what is at question with your model, as far as I can tell. If you are concerned that your model is not capturing what seems to be the extra spread (I guess you mean that the damage 3 histogram looks flatter or something like that?) then you can see if this behavior is capture by the model you’ve already made by running a posterior predictive check.

Use a function such as posterior_epred(m_ordinal_same) to get a posterior distribution of predicted ordinal outcomes from your model with the given force values. Randomly select a bunch of draws from the posterior, and then plot the predicted values for each of those draws in the same fashion as you have with the raw data here (cool plots by the way). You will then be able to see whether your model reproduces that observed behavior in the raw data.