Beta family distribution and posterior predictive checks with continuous outcome variable

Hi everyone!

I’m relatively new to Bayesian modeling, and I was hoping to get some insights into my approach.

I have a dataset of ratings, where people are asked to make binary choices about a given stimulus. They make many ratings about each stimulus over the course of an experiment, resulting in a continuous, average percent chosen variable that is bound between 0 and 1 (PercentChosen). I am interested in exploring how features about the stimulus (Stimulus) and the condition (Condition) that people were assigned to influence their average proportion of ratings. There are 21 different pairs of stimuli (Pair) and 57 participants (Participant) in this dataset. Stimulus is a continuous numeric variable, Condition is a factor with two levels (Friend, Rival), Pair is a factor with 21 levels, and Participant is a factor with 57 levels.

Please see my full brms model below:

brm(PercentChosen ~ Stimulus * Condition + (Condition|Pair) + (Condition|Participant),
data = df,
prior = c(prior(normal(0,10), class = "b"),
prior(cauchy(0,5), class = "sd"),
prior(normal(0.5,1), class = "Intercept")),
family = Beta(link = "logit"),
iter = 10000,
warmup = 2000,
init = 0,
chains = 4,
cores = 6)

Here is the model summary I get:

Family: beta 
Links: mu = logit; phi = identity 
Formula: PercentChosen ~ Stimulus * Condition + (Condition | Pair) + (Condition | Participant) 
Data: df (Number of observations: 9500) 
Draws: 4 chains, each with iter = 10000; warmup = 2000; thin = 1;
total post-warmup draws = 32000

Group-Level Effects: 
~Pair (Number of levels: 21) 
                              Estimate Est.Error l-95% CI u-95% CI    Rhat Bulk_ESS Tail_ESS
sd(Intercept)                  0.71489   0.12139  0.52183  0.99565 1.00095     6334    11065
sd(ConditionA)                 1.24624   0.21052  0.91177  1.73056 1.00112     6247    11241
cor(Intercept,ConditionA)     -0.97654   0.01324 -0.99278 -0.94268 1.00057     9074    15853

~Participant (Number of levels: 57) 
                              Estimate Est.Error l-95% CI u-95% CI    Rhat Bulk_ESS Tail_ESS
sd(Intercept)                  0.01371   0.01070  0.00050  0.03988 1.00017    26272    17285
sd(ConditionA)                 0.02098   0.01625  0.00081  0.06054 1.00019    23371    17556
cor(Intercept,ConditionA)     -0.16661   0.57854 -0.97376  0.93088 1.00006    45602    22570

Population-Level Effects: 
                                    Estimate Est.Error l-95% CI u-95% CI    Rhat Bulk_ESS Tail_ESS
Intercept                            0.01210   0.16454 -0.31312  0.33992 1.00129     3561     6790
Stimulus                            -0.21907   0.13921 -0.49233  0.05294 0.99993    37531    25153
ConditionA                          -0.11456   0.28352 -0.67777  0.44660 1.00123     3673     7099
Stimulus:ConditionA                  0.58161   0.19753  0.20340  0.96824 1.00014    38952    26825

Family Specific Parameters: 
    Estimate Est.Error l-95% CI u-95% CI    Rhat Bulk_ESS Tail_ESS
phi  0.92056   0.01099  0.89916  0.94206 1.00002    73177    22498

My question comes when I use pp_check() to run posterior predictive checks on this model.

pp_check(model, ndraws = 50)

pp_check(model, "stat_grouped", ndraws = 50, group = "Condition")

Based on online tutorials that I have been following (namely Causal inference with beta regression | A. Solomon Kurz), it seems like my approach of using the Beta distribution should be appropriate for the bounded (0,1) continuous outcome. However, I’m not sure why there are such differences between the observed data and posterior draws in my pp_check() distribution plots.

Any insights or guidance would be greatly appreciated, thank you!

Hello @helen_schmidt, what do the data look like? Based on the PPC there it looks to me like the data are strongly clustered around some round numbers (0, 0.2, 0.4, 0.6, 0.8 and 1, or thereabouts).

Another potential issue with the model is that the priors are quite wide. On the logistic scale this can create some strange behaviour because it puts a lot of prior weight on very large effects.

Hi there,

I will add some thought to what @AWoodward already said. To me, the data are not really continuous. If, for example, you have 10 ratings, the only numbers you can get are 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1, which does not meet the assumption of an infinite number of numbers.

Have you considered using a binomial model to model your outcomes? I think it would better fit with what you are doing as you can think of scoring 0 or 1 as “successes and failures”.

2 Likes

100% agree with @MarijnG’s suggestion.

Thank you everyone for such prompt and helpful replies!

I’ve narrowed my priors and switched to a beta-binomial distribution, which together seem to fix the problem. Thanks!

brm(SelectedCount | trials(PresentedCount) ~ Stimulus * Condition + (Condition|Pair) + (Condition|Participant),
data = df,
prior = c(prior(normal(0,5), class = "b"),
prior(cauchy(0,5), class = "sd"),
prior(normal(0.5,1), class = "Intercept")),
family = beta_binomial(link = "logit"),
iter = 10000,
warmup = 2000,
init = 0,
chains = 4,
cores = 6)

2 Likes