I have a situation where I have a 7-item Likert scale, where each respondent answers 6 questions on the same subject (TAM).
Essentially the same setup as in this question: Partial pooling on ordinal response (cumulative probit)
Sample data:
df <- data.frame(
id=rep(c("A", "B", "C", "D", "E", "F", "G"), each=6),
gender=c(rep("male", 6*4), rep("female", 6*3)),
name=rep(c("Q1", "Q2", "Q3", "Q4", "Q5", "Q6"), times=7),
response=c(rep(6,6), # resp.1, male
c(rep(7,4), rep(6,2)), # resp.2, male
rep(7,6), # resp.3, male
c(7, rep(6,4), 5), # resp.4, male
rep(7, 6), #resp.5, female
rep(6, 6), #resp.6, female
c(7, 6, 6, rep(5,3)) # resp.7, female
)) |>
mutate(ordered=factor(response, levels=1:7, ordered=TRUE),
revorder=factor(8-response, levels=1:7, ordered=TRUE))
df |> group_by(id, gender, ordered) |> tally()
I noted that brms
sampled more efficiently (meaning, no divergent transitions) if I reversed the scales (note that my data set only contains 5, 6, and 7). This is why is use the reverse order ( revorder
) in my formulas below.
I make heavy use of @Solomon s utiltiy function to plot the posterior mean (taken from his excellent blog Notes on the Bayesian cumulative probit | A. Solomon Kurz ):
plot_M2_posterior_mean <- function(m, title, sample_mean) {
m_male <- m$data |> filter(gender == 'male') |> sample_mean() |> pull()
m_female <- m$data |> filter(gender == 'female') |> sample_mean() |> pull()
# note, the magrittr pipe %>% is needed, because native |> pipe does not handle dot
as_draws_df(m) %>%
select(.draw, starts_with("b_")) %>%
set_names(".draw", str_c("tau[", 1:6, "]"), "beta[1]") %>%
# insert another copy of the data below
bind_rows(., .) %>%
# add the two values for the dummy variable gendermale
mutate(gendermale = rep(0:1, each = n() / 2)) %>%
# compute the p_k values conditional on the gender dummy
mutate(p1 = pnorm(`tau[1]`, mean = 0 + gendermale * `beta[1]`),
p2 = pnorm(`tau[2]`, mean = 0 + gendermale * `beta[1]`) - pnorm(`tau[1]`, mean = 0 + gendermale * `beta[1]`),
p3 = pnorm(`tau[3]`, mean = 0 + gendermale * `beta[1]`) - pnorm(`tau[2]`, mean = 0 + gendermale * `beta[1]`),
p4 = pnorm(`tau[4]`, mean = 0 + gendermale * `beta[1]`) - pnorm(`tau[3]`, mean = 0 + gendermale * `beta[1]`),
p5 = pnorm(`tau[5]`, mean = 0 + gendermale * `beta[1]`) - pnorm(`tau[4]`, mean = 0 + gendermale * `beta[1]`),
p6 = pnorm(`tau[6]`, mean = 0 + gendermale * `beta[1]`) - pnorm(`tau[5]`, mean = 0 + gendermale * `beta[1]`),
p7 = 1 - pnorm(`tau[6]`, mean = 0 + gendermale * `beta[1]`)) %>%
# wrangle
pivot_longer(starts_with("p"), values_to = "p") %>%
mutate(response = str_extract(name, "\\d") %>% as.double()) %>%
# compute p_k * k
mutate(`p * response` = p * response) %>%
# sum those values within each posterior draw, by the gendermale dummy
group_by(.draw, gendermale) %>%
summarise(mean_response = sum(`p * response`)) %>%
mutate(gender = ifelse(gendermale == 0, "female", "male")) %>%
# the trick with and without fct_rev() helps order the axes, colors, and legend labels
ggplot(aes(x = mean_response, y = fct_rev(gender), fill = gender)) +
stat_halfeye(.width = .95) +
geom_vline(xintercept = m_male, linetype = 2, color = "blue3") +
geom_vline(xintercept = m_female, linetype = 2, color = "red3") +
scale_fill_manual(NULL, values = c(alpha("red3", 0.5), alpha("blue3", 0.5))) +
labs(title = paste0("The posterior for the mean of the rating values for ", title, ", by gender"),
subtitle = paste("The dashed vertical lines mark off the sample means, by gender\nFormula:", m$formula),
x = expression(mu[response]),
y = NULL) +
theme(axis.text.y = element_text(hjust = 0))
}
When I use the formula and priors:
formula <- revorder | thres(6) ~ 1 + gender + (1 | id)
priors1 <- c(prior(normal(-1.068, 1), class = Intercept, coef = 1),
prior(normal(-0.566, 1), class = Intercept, coef = 2),
prior(normal(-0.180, 1), class = Intercept, coef = 3),
prior(normal( 0.180, 1), class = Intercept, coef = 4),
prior(normal( 0.566, 1), class = Intercept, coef = 5),
prior(normal( 1.068, 1), class = Intercept, coef = 6),
prior(normal(0, 1), class = b),
prior(exponential(1), class = sd))
M3 <- brm(
data = df |> select(revorder, gender, id),
family = cumulative(probit),
formula=formula,
prior = priors1,
drop_unused_levels = F,
backend="cmdstanr",
cores = 4,
seed = 1
)
I get the following summary:
Family: cumulative
Links: mu = probit; disc = identity
Formula: revorder | thres(6) ~ 1 + gender + (1 | id)
Data: select(df, revorder, gender, id) (Number of observations: 42)
Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
total post-warmup draws = 4000
Multilevel Hyperparameters:
~id (Number of levels: 7)
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept) 2.88 1.00 1.34 5.19 1.00 1189 1698
Regression Coefficients:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept[1] -2.36 0.77 -3.92 -0.88 1.00 1819 2037
Intercept[2] -0.31 0.72 -1.74 1.04 1.00 1979 2037
Intercept[3] 0.60 0.69 -0.78 1.94 1.00 2324 2697
Intercept[4] 0.94 0.69 -0.42 2.25 1.00 2533 2872
Intercept[5] 1.37 0.72 -0.04 2.82 1.00 2810 3447
Intercept[6] 2.07 0.85 0.49 3.81 1.00 3237 3237
gendermale -0.07 0.88 -1.81 1.65 1.00 1842 2146
and the following posterior mean (based on the function above):
Note that the regularized mean is quite close to 3
, and far away from the sample mean.
When I choose a different prior for the sd
, such as the narrower exp(5)
, I get closer to the sample mean:
priors5 <- c(prior(normal(-1.068, 1), class = Intercept, coef = 1),
prior(normal(-0.566, 1), class = Intercept, coef = 2),
prior(normal(-0.180, 1), class = Intercept, coef = 3),
prior(normal( 0.180, 1), class = Intercept, coef = 4),
prior(normal( 0.566, 1), class = Intercept, coef = 5),
prior(normal( 1.068, 1), class = Intercept, coef = 6),
prior(normal(0, 1), class = b),
prior(exponential(5), class = sd))
M3exp5 <- brm(
data = df |> select(revorder, gender, id),
family = cumulative(probit),
formula=formula,
prior = priors5,
drop_unused_levels = F,
backend="cmdstanr",
cores = 4,
seed = 1
)
Here, we are closer to the sample mean (and we could get even closer, by choosing an even narrower prior, e.g., exp(10)
The real a-ha moment came when I simulated 40 new respondents, that are extremely positive (all ranking 1
on the Likert)
This is of course easy to simulate by adding fake data:
fakedata <- rbind(df |> select(revorder, gender, id),
data.frame(revorder=1, # rave reviews
gender=rep(c("female", "male"), each=120),
id=rep(c("100", "101", "102", "103", "104", "105", "106", "107", "108", "109",
"1100", "1101", "1102", "1103", "1104", "1105", "1106", "1107", "1108", "1109",
"200", "201", "202", "203", "204", "205", "206", "207", "208", "209",
"2100", "2101", "2102", "2103", "2104", "2105", "2106", "2107", "2108", "2109"), each=6) # 40 new respondents
)
)
fakeM3exp1 <- brm(
data = fakedata,
family = cumulative(probit),
formula=formula,
prior = priors1,
drop_unused_levels = F,
backend="cmdstanr",
cores = 4,
seed = 1
)
Plotting the posterior mean shows that the expected value is still around 3
:
Using the exp(5)
(or, even more so, exp(10)
priors make the data align considerably better to the sample mean:
The reason for this behaviour is presumably because of the “long tail” of the exp(1)
distribution, and the fact that the probit has a relatively narrow span.
But how do I know which prior to choose? I find choosing priors that influence standard deviations much harder to intuit than others (being the newbie I still am, unfortunately…)
In this case, I know, by design, that no matter how many respondents ( id
in the data) I get, each will only generate 6 answers. I suspect this is the reason the exp(1)
prior fails (it is too strong, dragging the posterior mean too much towards the center of the scale)
Although I only tried this with the cumulative probit, I suspect the similar issue will arise with the logit.
EDIT: Found another indication that the exp(1)
prior is indeed too strong for this use case (remember, every respondent, which we pool on, will only ever give 6 responses)
I decided to try out the varying variance example from @Solomon also.
Though only as a fixed effect, so my model became this:
formula <- bf(revorder | thres(6) ~ 1 + gender + (1 | id)) +
lf(disc ~ 0 + gender, cmc = FALSE)
priors1 <- c(prior(normal(-1.068, 1), class = Intercept, coef = 1),
prior(normal(-0.566, 1), class = Intercept, coef = 2),
prior(normal(-0.180, 1), class = Intercept, coef = 3),
prior(normal( 0.180, 1), class = Intercept, coef = 4),
prior(normal( 0.566, 1), class = Intercept, coef = 5),
prior(normal( 1.068, 1), class = Intercept, coef = 6),
prior(normal(0, 1), class = b),
prior(normal(0, log(2)/2), class = b, dpar=disc),
prior(exponential(1), class = sd)
)
M4_ppc <- brm(
data = df,
family = cumulative(probit),
formula=formula,
prior = priors1,
drop_unused_levels = F,
sample_prior = "only",
backend="cmdstanr",
cores = 4,
seed = 1,
control = list(adapt_delta=0.99)
)
PPC looks very good:
Based on my understanding on pnorm
and disc
, I modified the plot function to incorporate the varying sd
:
plot_M4_posterior_mean <- function(m, title, sample_mean) {
m_male <- m$data |> filter(gender == 'male') |> sample_mean() |> pull()
m_female <- m$data |> filter(gender == 'female') |> sample_mean() |> pull()
# note, the magrittr pipe %>% is needed, because native |> pipe does not handle dot
as_draws_df(m) %>%
select(.draw, starts_with("b_")) %>%
set_names(".draw", str_c("tau[", 1:6, "]"), "beta[1]", "b_disc[1]") %>%
# insert another copy of the data below
bind_rows(., .) %>%
# add the two values for the dummy variable gendermale
mutate(gendermale = rep(0:1, each = n() / 2)) %>%
# compute the p_k values conditional on the gender dummy
mutate(p1 = pnorm(`tau[1]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p2 = pnorm(`tau[2]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)) - pnorm(`tau[1]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p3 = pnorm(`tau[3]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)) - pnorm(`tau[2]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p4 = pnorm(`tau[4]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)) - pnorm(`tau[3]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p5 = pnorm(`tau[5]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)) - pnorm(`tau[4]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p6 = pnorm(`tau[6]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)) - pnorm(`tau[5]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`)),
p7 = 1 - pnorm(`tau[6]`, mean = 0 + gendermale * `beta[1]`, sd = 1/exp(0 + gendermale * `b_disc[1]`))) %>%
# wrangle
pivot_longer(starts_with("p"), values_to = "p") %>%
mutate(response = str_extract(name, "\\d") %>% as.double()) %>%
# compute p_k * k
mutate(`p * response` = p * response) %>%
# sum those values within each posterior draw, by the gendermale dummy
group_by(.draw, gendermale) %>%
summarise(mean_response = sum(`p * response`)) %>%
mutate(gender = ifelse(gendermale == 0, "female", "male")) %>%
# the trick with and without fct_rev() helps order the axes, colors, and legend labels
ggplot(aes(x = mean_response, y = fct_rev(gender), fill = gender)) +
stat_halfeye(.width = .95) +
geom_vline(xintercept = m_female, linetype = 2, color = "red3") +
geom_vline(xintercept = m_male, linetype = 2, color = "blue3") +
scale_fill_manual(NULL, values = c(alpha("red3", 0.5), alpha("blue3", 0.5))) +
labs(title = paste0("The posterior for the mean of the rating values for ", title, ", by gender"),
subtitle = paste("The dashed vertical lines mark off the sample means, by gender\nFormula:", m$formula),
x = expression(mu[response]),
y = NULL) +
theme_bw()
}
Running and plotting this model (with the small data set from the top of this post), works OK, on the surface:
M4 <- brm(
data = df,
family = cumulative(probit),
formula=formula,
prior = priors1,
drop_unused_levels = F,
backend="cmdstanr",
cores = 4,
seed = 1,
control = list(adapt_delta=0.99)
)
But the killer really comes when I wanted to check what happens when I get 40 rave females (all giving 1 in rating), so let’s create some fake data and see how the model reacts:
fakedata <- rbind(df |> select(revorder, gender, id),
data.frame(revorder=1, # rave reviews
gender=rep(c("female", "female"), each=120),
id=rep(c("100", "101", "102", "103", "104", "105", "106", "107", "108", "109",
"1100", "1101", "1102", "1103", "1104", "1105", "1106", "1107", "1108", "1109",
"200", "201", "202", "203", "204", "205", "206", "207", "208", "209",
"2100", "2101", "2102", "2103", "2104", "2105", "2106", "2107", "2108", "2109"), each=6) # 40 new respondents
)
)
Because I only affect one of the categories (and very much so), we would expect that category to “narrow down”, and shrink towards the category mean.
But instead, the model just goes wild:
If I instead choose the narrower exp(5)
prior, I get the desired behaviour (of sorts):
priors5 <- c(prior(normal(-1.068, 1), class = Intercept, coef = 1),
prior(normal(-0.566, 1), class = Intercept, coef = 2),
prior(normal(-0.180, 1), class = Intercept, coef = 3),
prior(normal( 0.180, 1), class = Intercept, coef = 4),
prior(normal( 0.566, 1), class = Intercept, coef = 5),
prior(normal( 1.068, 1), class = Intercept, coef = 6),
prior(normal(0, 1), class = b),
prior(normal(0, log(2)/2), class = b, dpar=disc),
prior(exponential(5), class = sd)
)
This, to me, is evidence that unlike “just beta priors”, when you deal with choosing sd
priors (and I strongly suspect also cor
priors), you’d be better off by actually testing your priors with a couple of “fake data sets”, to see what effects your priors have on your model when faced with synthetic (and therefore predictable) data.
Also @Solomon , I would highly appreciate if you could review my adaptation of your plot function for the posterior mean ( plot_M4_posterior_mean
). I am not totally at home in R yet, and I have never worked with disc
before.
Please also provide the following information in addition to your question:
- Operating System: 22.04.1-Ubuntu
- brms Version: Package brms version 2.22.0