Hi All,

I’m trying get a sense for fitting a cumulative probit model in `brms`

by using simulation to see if I can recover known parameter values. The set-up is that I have an ordinal outcome variable (a 5 category Likert response) and several categorical predictors. I’d like to model these categorical predictors using the “index” method rather than treating one level as the base. I’ve been relying on @paul.buerkner’s paper and @Solomon 's post to gain some intuition for the model and the nature of the \tau parameters. I’m ultimately trying to fit a much more complicated model, but wanting to start simple with the simulation to make sure I’m understanding how the pieces fit. Unfortunately, I seem to consistently underestimate the \alpha_{sex} varying intercept despite a lot of fiddling with priors and model specification. This has led me to question whether I actually am understanding how to interpret varying intercepts in the context of the cumulative probit model (where the global intercepts are actually the threshold values and not some overall mean). I’m hoping someone can chime in here to help me:

**a) verify that my simulation of the ordinal outcomes is correct,**

**b) ensure that I’m getting the posterior estimate of the varying intercepts correctly**, and

**c) understand the interpretation of varying intercepts in a cumulative probit model,**

The basic structure of the model is:

Where \tau_{k} are the cutpoints (K=4) and \alpha_{sex} is a 3-level categorical variable and $\alpha_{age} is a 5-level categorical variable.

I simulate the data using:

```
library(tidyverse)
library(brms)
library(tidybayes)
set.seed(5) # used to replicate example
#generate the known values
n_category_age <- 5
n_category_sex <- 3
a_sex <- runif(n_category_sex, -5, 5)
a_age <- runif(n_category_age, -5, 5)
#generate data
n_obs <- 5000
obs_age <- sample(1:n_category_age, n_obs, replace=TRUE)
obs_sex <- sample(1:n_category_sex, n_obs, replace=TRUE)
#generate rating response
rating_linpred <- a_sex[obs_sex] + a_age[obs_age]
thresholds <- tibble(rating = 1:5) %>%
mutate(proportion = 1/5) %>%
mutate(cumulative_proportion = cumsum(proportion)) %>%
mutate(right_hand_threshold = qnorm(cumulative_proportion))
n_tau <- 4 #the number of ratings - 1
tau <- thresholds$right_hand_threshold[1:n_tau]
probs <- matrix(nrow=n_obs, ncol=(n_tau + 1)) #need probability of landing in last group
probs[,1] <- pnorm(tau[1], mean=0 + rating_linpred)
probs[,2] <- pnorm(tau[2], mean=0 + rating_linpred) - pnorm(tau[1], mean=0 + rating_linpred)
probs[,3] <- pnorm(tau[3], mean=0 + rating_linpred) - pnorm(tau[2], mean=0 + rating_linpred)
probs[,4] <- pnorm(tau[4], mean=0 + rating_linpred) - pnorm(tau[3], mean=0 + rating_linpred)
probs[,5] <- 1 - pnorm(tau[4], mean=0 + rating_linpred)
rating <- vector(mode="integer", length = n_obs)
for(i in 1:nrow(probs)){
draws <- rmultinom(1,1, prob = probs[i,])
rating[i] <- which(draws[,1] !=0)
}
```

Which yields:

I then fit the model using:

```
df <- data.frame(cbind(obs_age, obs_sex, rating))
fit1 <- brm(
data = df,
family = cumulative(probit),
formula = bf(rating ~ 1 + age + sex,
age ~ 0 + (1|obs_age),
sex ~ 0 + (1|obs_sex),
nl=TRUE),
prior = c(prior(normal(-0.8, 1), class = Intercept, coef = 1),
prior(normal(-0.25, 1), class = Intercept, coef = 2),
prior(normal(0.25, 1), class = Intercept, coef = 3),
prior(normal(0.85, 1), class = Intercept, coef = 4),
prior(exponential(1.5), class= sd, nlpar = "age"),
prior(exponential(1.5), class= sd, nlpar="sex")),
cores = 4,
seed = 1,
init_r = 0.2,
control=list(max_treedepth = 15)
)
```

The model fits fine (no divergences) and returns estimates for tau_k and \alpha_{age} that are very close to the true values; however, the \alpha_{sex} values are consistently lower than the true values no matter how much I ‘weaken’ the prior on the varying intercepts:

```
sex_draws <- fit1 %>%
spread_draws( r_obs_sex__sex[condition,term])
xint <- data.frame(condition = 1:length(a_sex), vline = a_sex)
ggplot(data = sex_draws, aes(x=r_obs_sex__sex)) +
stat_halfeye(fill='lightcoral') +
facet_wrap(~condition) +
geom_vline(data= xint, aes(xintercept = vline), linetype='dashed')+
ggtitle(expression("Comparison of posterior estimates of" ~ alpha[sex]),
subtitle = "Dashed line indicates true value" )
```

I would love your thoughts on:

- Is it appropriate to fit varying intercepts on the linear predictor, this way?
- Am I simulating the rating response appropriately?
- In estimating the varying intercepts do I need to somehow incorporate the estimates of \tau_k
- How to think about the priors on the varying intercepts in the context where the global intercept (\tau_k) doesn’t refer to an average and \beta_0 is set to zero (for identifiability)

Apologies for the long post - please feel free to respond to any of these pieces as I’m trying to learn as much as I can about these models and how to fit them in `brms`

.