Simulating data for Dirichlet regression with varying estimates

Hello everyone,

The shortest formulation of my problem is the following:

I’m trying to design a power analysis by simulation for a Dirichlet regression model using brms. How could we calculate the shape parameter alpha of the Dirichlet distribution by hand from a theoretical model in order to feed it to brms::rdirichlet(n, alpha) to simulate data with chosen effects? The goal is to tweak manually the coefficients of the model to simulate data with varying effect sizes.

Now for the lengthier context and what I’ve tried:

The outcome I want to predict is a composition Y of four emotions (anger, disgust, sadness, neutral). I have two categorical predictors with two levels called Domain (Harm, Purity) and Focus (Self, Other). Thus, the model is Y ~ Domain + Focus + Domain:Focus. If my understanding of this vignette is correct (I suspect it isn’t), Y_i = exp(alpha_i) / sum(exp(alpha_k) for the mean proportion of emotion i, alpha being the shape parameter of the Dirichlet distribution.

I build some toy data by hand and fitted the model on it, then tried to make sense of the coefficients. Here is the model:

bind <- function(...) cbind(...)

model <-
  brm(
    formula = bind(anger, disgust, sadness, neutral) ~ domain * focus,
    data = df,
    family = dirichlet()
  )

summary(model)
Family: dirichlet
Links: mudisgust = logit; musadness = logit; muneutral = logit; phi = identity
Formula: y ~ domain * focus
Data: df (Number of observations: 48)
Draws: 7 chains, each with iter = 6000; warmup = 2000; thin = 1; 
total post-warmup draws = 28000

Regression Coefficients: [see down below]

Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 4.60 0.65 3.41 5.97 1.00 22872 19779

Here are the regression coefficients with a bit of formatting on my end:

I know that given a brmsfit model, I can just use predict() (or any equivalent) to retrieve the composition I was aiming for above, as advised in several interesting threads like here or here. Here is a plot of these predicted values for this toy model, which fit nicely with the original data (the plot is custom, but it is virtually the same as brms::conditional_effects):

If my understanding is correct, when domain is Harm (0 dummy coded), proportions for (anger, disgust, sadness, neutral) should be somewhere around (0.6, 0.3, 0.1, 0), and when it is Purity (1 dummy coded), they should be around (0.1, 0.7, 0.2, 0).

However, when I try to retrieve approximate proportions using the link formula, I can’t seem to get these values, especially because the Purity coefficient for Sadness is higher than Disgust. I wrote a function that I thought to be able to compute the composition based on the coefficients:

retrieve_composition <- function(domain, focus){
  
  # the beta coefficients from the summary
  beta_0 <- c(0, -0.54, -1.1, -2.96)
  beta_1 <- c(0, 3.13, 4.79, 2.95)
  beta_2 <- c(0, -0.12, -0.99, -0.06)
  beta_12 = c(0, -0.4, -1.71, -0.61)
  
  alpha = c(
    beta_0[1] + beta_1[1]*domain + beta_2[1]*focus + beta_12[1]*domain*focus,
    beta_0[2] + beta_1[2]*domain + beta_2[2]*focus + beta_12[2]*domain*focus,
    beta_0[3] + beta_1[3]*domain + beta_2[3]*focus + beta_12[3]*domain*focus,
    beta_0[4] + beta_1[4]*domain + beta_2[4]*focus + beta_12[4]*domain*focus
    )
  
  composition = exp(alpha)/sum(exp(alpha))
  
  return(composition)
}

It looks ok for the intercept, as retrieve_composition(0, 0) gives (.51, .30, .17, .03). But for the Purity condition, i.e. retrieve_composition(1, 0), I get (.02, .24, .72, .02), when it should be (0.15, 0.7, 0.1, 0)… Which is logical, because beta_1[2] is lower than beta_1[3]. However, these are the brms model coefficients, and the model fits and predicts the data correctly…

I can’t figure out how to retrieve the composition for the Purity condition. I feel like I’ve made an obvious misinterpretation due to my lack of mathematical literacy, but I don’t understand where… I’ve run out of ideas. What have I done wrong?

Sorry for the lengthy post, and thanks a lot for your help!

  • Operating System: Windows 10
  • brms Version: 2.21.0

Hi, @m-delem and welcome to the Stan forums. Sorry for the slow response. We have way more brms questions on the forums than we have brms users to answer them. I don’t know brms, but I’ll give it a shot w/o worrying about brms details.

I’m not sure what you mean by a theoretical model here. Usually theoretical models don’t involve actual values for parameters. What do you expect alpha to be? We know alpha / sum(alpha) is the posterior mean (it’s not exponentiated as you wrote—the Wikipedia is good for this kind of thing). Then marginally, each variable n is distributed beta(alpha[n], sum(alpha) - alpha[n]). If you have a bunch of examples of y, where y ~ dirichlet(alpha), then you can fit alpha in Stan.

The retrieve_composition() function should work as intended.

You actually do have to exponentiate the coefficients reported by summary(fit) because brms parametrizes the dirichlet likelihood on the logit scale. It’s confusing to call the coefficient alpha because the Stan documentation uses alpha for the parameterization on the probability scale.

These are the relevant lines in the Stan code generated by brms, which I got with make_stancode(model, data, dirichlet()).

real dirichlet_logit_lpdf(vector y, vector mu, real phi) {
  return dirichlet_lpdf(y | softmax(mu) * phi);
}

...

for (n in 1:N) {
  target += dirichlet_logit_lpdf(Y[n] | mu[n], phi);
}

In retrieve_composition() estimates are rounded to two decimal places, so there is some loss of accuracy but not enough to explain the discrepancy. Perhaps you didn’t copy the estimates in the right order. I suspect that because I had trouble doing it correctly myself.

It’s a bit easier to extract the coefficients from the summary by name. For example:

retrieve_composition <- function(domain, focus) {
  # the beta coefficients from the summary
  beta_hat <- summary(fit)$fixed
  beta_hat <- setNames(beta_hat[, "Estimate"], rownames(beta_hat))

  beta_0 <- c(
    0,
    beta_hat["mudisgust_Intercept"],
    beta_hat["muneutral_Intercept"],
    beta_hat["musadness_Intercept"]
  )
  beta_1 <- c(
    0,
    beta_hat["mudisgust_domainPurity"],
    beta_hat["muneutral_domainPurity"],
    beta_hat["musadness_domainPurity"]
  )
  beta_2 <- c(
    0,
    beta_hat["mudisgust_focusOther"],
    beta_hat["muneutral_focusOther"],
    beta_hat["musadness_focusOther"]
  )
  beta_12 <- c(
    0,
    beta_hat["mudisgust_domainPurity:focusOther"],
    beta_hat["muneutral_domainPurity:focusOther"],
    beta_hat["musadness_domainPurity:focusOther"]
  )

  mu <- c(
    beta_0[1] + beta_1[1] * domain + beta_2[1] * focus + beta_12[1] * domain * focus,
    beta_0[2] + beta_1[2] * domain + beta_2[2] * focus + beta_12[2] * domain * focus,
    beta_0[3] + beta_1[3] * domain + beta_2[3] * focus + beta_12[3] * domain * focus,
    beta_0[4] + beta_1[4] * domain + beta_2[4] * focus + beta_12[4] * domain * focus
  )

  softmax(mu)
}

However, there is an easier way to do this calculation: use posterior_epred or posterior_linpred.

With posterior_linpred you’ll extract the draws for the parameters on logit scale, average over the draws and then apply the softmax transformation to the averages. This is essentially what retrieve_composition() does.

With posterior_epred you’ll first apply the softmax transformation to get probabilities, then average those probabilities over the draws.

These options give two (slightly) different answers.

dat10 <- data.frame(domain = "Purity", focus = "Self")

alpha <- summarise_draws(
  posterior_epred(fit, newdata = dat10)
)
alpha
#> # A tibble: 4 × 10
#>   variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
#>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
#> 1 anger    0.398  0.397 0.0449 0.0447 0.325  0.473  1.00    2544.    2908.
#> 2 disgust  0.181  0.181 0.0356 0.0354 0.123  0.240  1.00    3886.    2571.
#> 3 neutral  0.271  0.271 0.0406 0.0406 0.207  0.339  1.00    3893.    2721.
#> 4 sadness  0.150  0.149 0.0323 0.0323 0.0999 0.207  1.00    4171.    2539.

mu <- summarise_draws(
  posterior_linpred(fit, newdata = dat10)
)
mu
#> # A tibble: 3 × 10
#>   variable   mean median    sd   mad     q5     q95  rhat ess_bulk ess_tail
#>   <chr>     <dbl>  <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
#> 1 ...1     -0.803 -0.795 0.271 0.265 -1.26  -0.374   1.00    3081.    2577.
#> 2 ...2     -0.387 -0.382 0.231 0.225 -0.769 -0.0136  1.00    2965.    3094.
#> 3 ...3     -0.989 -0.976 0.282 0.278 -1.46  -0.542   1.00    3241.    2926.
softmax(c(0, mu$mean))
#> [1] 0.4001024 0.1792818 0.2717682 0.1488476

retrieve_composition(1, 0)
#>                     mudisgust_Intercept muneutral_Intercept musadness_Intercept 
#>           0.4001024           0.1792818           0.2717682           0.1488476

Note: Numbers don’t match those in the question because I simulated some fake data.

Hello, thank you for the answer!

By theoretical model I meant the combination of predictors that we hypothesise to be the data generating process, say X1 + X2 + X1:X2. Indeed, this does not involve parameter values (I think my question was badly formulated). I wanted to inject parameter values in there because I chose to conceptualise them as my “effect sizes” for power analyses. In my case, I want to examine the impact of the X1:X2 effect size on statistical power, thus I figured that I would like to test several values of beta_X1:X2 “by hand”. Hence the need for this “reverse-engineering” from parameter values : I want to set betas → obtain alpha → simulate data with alpha → fit the model → examine if the effect was detected → rinse and repeat to obtain statistical power.

However, I chose to conceptualise the effect size as the betas on my own, and that itself might be the problem. Is it? I had to take a chance because I couldn’t find a single trace of an effect size theory for the Dirichlet regression on the internet, let alone for power analysis by simulation with Dirichlet (it was hard to find ressources for the regression itself to begin with).

What do you think about this method?

Thank you for the answer! I’m relieved to see that what I’ve done actually makes sense, I was starting to go insane working alone on this. I’m a bit far from my code right now, but I’ll go back to it as soon as I can to check if I got all mixed up in the estimates. At the time I thought I checked that, but sometimes you need to take a step back to understand (and see the obvious mistakes)!