# Simulating data for Dirichlet regression with varying estimates

Hello everyone,

The shortest formulation of my problem is the following:

I’m trying to design a power analysis by simulation for a Dirichlet regression model using `brms`. How could we calculate the shape parameter alpha of the Dirichlet distribution by hand from a theoretical model in order to feed it to `brms::rdirichlet(n, alpha)` to simulate data with chosen effects? The goal is to tweak manually the coefficients of the model to simulate data with varying effect sizes.

Now for the lengthier context and what I’ve tried:

The outcome I want to predict is a composition Y of four emotions (anger, disgust, sadness, neutral). I have two categorical predictors with two levels called Domain (Harm, Purity) and Focus (Self, Other). Thus, the model is `Y ~ Domain + Focus + Domain:Focus`. If my understanding of this vignette is correct (I suspect it isn’t), Y_i = exp(alpha_i) / sum(exp(alpha_k) for the mean proportion of emotion i, alpha being the shape parameter of the Dirichlet distribution.

I build some toy data by hand and fitted the model on it, then tried to make sense of the coefficients. Here is the model:

``````bind <- function(...) cbind(...)

model <-
brm(
formula = bind(anger, disgust, sadness, neutral) ~ domain * focus,
data = df,
family = dirichlet()
)

summary(model)
``````
``````Family: dirichlet
Links: mudisgust = logit; musadness = logit; muneutral = logit; phi = identity
Formula: y ~ domain * focus
Data: df (Number of observations: 48)
Draws: 7 chains, each with iter = 6000; warmup = 2000; thin = 1;
total post-warmup draws = 28000

Regression Coefficients: [see down below]

Further Distributional Parameters:
Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
phi 4.60 0.65 3.41 5.97 1.00 22872 19779
``````

Here are the regression coefficients with a bit of formatting on my end:

I know that given a `brmsfit` model, I can just use `predict()` (or any equivalent) to retrieve the composition I was aiming for above, as advised in several interesting threads like here or here. Here is a plot of these predicted values for this toy model, which fit nicely with the original data (the plot is custom, but it is virtually the same as `brms::conditional_effects`):

If my understanding is correct, when domain is Harm (0 dummy coded), proportions for (anger, disgust, sadness, neutral) should be somewhere around (0.6, 0.3, 0.1, 0), and when it is Purity (1 dummy coded), they should be around (0.1, 0.7, 0.2, 0).

However, when I try to retrieve approximate proportions using the link formula, I can’t seem to get these values, especially because the Purity coefficient for Sadness is higher than Disgust. I wrote a function that I thought to be able to compute the composition based on the coefficients:

``````retrieve_composition <- function(domain, focus){

# the beta coefficients from the summary
beta_0 <- c(0, -0.54, -1.1, -2.96)
beta_1 <- c(0, 3.13, 4.79, 2.95)
beta_2 <- c(0, -0.12, -0.99, -0.06)
beta_12 = c(0, -0.4, -1.71, -0.61)

alpha = c(
beta_0[1] + beta_1[1]*domain + beta_2[1]*focus + beta_12[1]*domain*focus,
beta_0[2] + beta_1[2]*domain + beta_2[2]*focus + beta_12[2]*domain*focus,
beta_0[3] + beta_1[3]*domain + beta_2[3]*focus + beta_12[3]*domain*focus,
beta_0[4] + beta_1[4]*domain + beta_2[4]*focus + beta_12[4]*domain*focus
)

composition = exp(alpha)/sum(exp(alpha))

return(composition)
}
``````

It looks ok for the intercept, as `retrieve_composition(0, 0)` gives (.51, .30, .17, .03). But for the Purity condition, i.e. `retrieve_composition(1, 0)`, I get (.02, .24, .72, .02), when it should be (0.15, 0.7, 0.1, 0)… Which is logical, because `beta_1[2]` is lower than `beta_1[3]`. However, these are the `brms` model coefficients, and the model fits and predicts the data correctly…

I can’t figure out how to retrieve the composition for the Purity condition. I feel like I’ve made an obvious misinterpretation due to my lack of mathematical literacy, but I don’t understand where… I’ve run out of ideas. What have I done wrong?

Sorry for the lengthy post, and thanks a lot for your help!

• Operating System: Windows 10
• brms Version: 2.21.0

Hi, @m-delem and welcome to the Stan forums. Sorry for the slow response. We have way more brms questions on the forums than we have brms users to answer them. I don’t know brms, but I’ll give it a shot w/o worrying about brms details.

I’m not sure what you mean by a theoretical model here. Usually theoretical models don’t involve actual values for parameters. What do you expect `alpha` to be? We know `alpha / sum(alpha)` is the posterior mean (it’s not exponentiated as you wrote—the Wikipedia is good for this kind of thing). Then marginally, each variable n is distributed `beta(alpha[n], sum(alpha) - alpha[n])`. If you have a bunch of examples of `y`, where `y ~ dirichlet(alpha)`, then you can fit `alpha` in Stan.

The `retrieve_composition()` function should work as intended.

You actually do have to exponentiate the coefficients reported by `summary(fit)` because brms parametrizes the dirichlet likelihood on the logit scale. It’s confusing to call the coefficient `alpha` because the Stan documentation uses alpha for the parameterization on the probability scale.

These are the relevant lines in the Stan code generated by brms, which I got with `make_stancode(model, data, dirichlet())`.

``````real dirichlet_logit_lpdf(vector y, vector mu, real phi) {
return dirichlet_lpdf(y | softmax(mu) * phi);
}

...

for (n in 1:N) {
target += dirichlet_logit_lpdf(Y[n] | mu[n], phi);
}
``````

In `retrieve_composition()` estimates are rounded to two decimal places, so there is some loss of accuracy but not enough to explain the discrepancy. Perhaps you didn’t copy the estimates in the right order. I suspect that because I had trouble doing it correctly myself.

It’s a bit easier to extract the coefficients from the summary by name. For example:

``````retrieve_composition <- function(domain, focus) {
# the beta coefficients from the summary
beta_hat <- summary(fit)\$fixed
beta_hat <- setNames(beta_hat[, "Estimate"], rownames(beta_hat))

beta_0 <- c(
0,
beta_hat["mudisgust_Intercept"],
beta_hat["muneutral_Intercept"],
)
beta_1 <- c(
0,
beta_hat["mudisgust_domainPurity"],
beta_hat["muneutral_domainPurity"],
)
beta_2 <- c(
0,
beta_hat["mudisgust_focusOther"],
beta_hat["muneutral_focusOther"],
)
beta_12 <- c(
0,
beta_hat["mudisgust_domainPurity:focusOther"],
beta_hat["muneutral_domainPurity:focusOther"],
)

mu <- c(
beta_0[1] + beta_1[1] * domain + beta_2[1] * focus + beta_12[1] * domain * focus,
beta_0[2] + beta_1[2] * domain + beta_2[2] * focus + beta_12[2] * domain * focus,
beta_0[3] + beta_1[3] * domain + beta_2[3] * focus + beta_12[3] * domain * focus,
beta_0[4] + beta_1[4] * domain + beta_2[4] * focus + beta_12[4] * domain * focus
)

softmax(mu)
}
``````

However, there is an easier way to do this calculation: use `posterior_epred` or `posterior_linpred`.

With `posterior_linpred` you’ll extract the draws for the parameters on logit scale, average over the draws and then apply the softmax transformation to the averages. This is essentially what `retrieve_composition()` does.

With `posterior_epred` you’ll first apply the softmax transformation to get probabilities, then average those probabilities over the draws.

These options give two (slightly) different answers.

``````dat10 <- data.frame(domain = "Purity", focus = "Self")

alpha <- summarise_draws(
posterior_epred(fit, newdata = dat10)
)
alpha
#> # A tibble: 4 × 10
#>   variable  mean median     sd    mad     q5   q95  rhat ess_bulk ess_tail
#>   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>    <dbl>    <dbl>
#> 1 anger    0.398  0.397 0.0449 0.0447 0.325  0.473  1.00    2544.    2908.
#> 2 disgust  0.181  0.181 0.0356 0.0354 0.123  0.240  1.00    3886.    2571.
#> 3 neutral  0.271  0.271 0.0406 0.0406 0.207  0.339  1.00    3893.    2721.
#> 4 sadness  0.150  0.149 0.0323 0.0323 0.0999 0.207  1.00    4171.    2539.

mu <- summarise_draws(
posterior_linpred(fit, newdata = dat10)
)
mu
#> # A tibble: 3 × 10
#>   variable   mean median    sd   mad     q5     q95  rhat ess_bulk ess_tail
#>   <chr>     <dbl>  <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl>    <dbl>    <dbl>
#> 1 ...1     -0.803 -0.795 0.271 0.265 -1.26  -0.374   1.00    3081.    2577.
#> 2 ...2     -0.387 -0.382 0.231 0.225 -0.769 -0.0136  1.00    2965.    3094.
#> 3 ...3     -0.989 -0.976 0.282 0.278 -1.46  -0.542   1.00    3241.    2926.
softmax(c(0, mu\$mean))
#> [1] 0.4001024 0.1792818 0.2717682 0.1488476

retrieve_composition(1, 0)
#>           0.4001024           0.1792818           0.2717682           0.1488476
``````

Note: Numbers don’t match those in the question because I simulated some fake data.

Hello, thank you for the answer!

By theoretical model I meant the combination of predictors that we hypothesise to be the data generating process, say X1 + X2 + X1:X2. Indeed, this does not involve parameter values (I think my question was badly formulated). I wanted to inject parameter values in there because I chose to conceptualise them as my “effect sizes” for power analyses. In my case, I want to examine the impact of the X1:X2 effect size on statistical power, thus I figured that I would like to test several values of beta_X1:X2 “by hand”. Hence the need for this “reverse-engineering” from parameter values : I want to set betas → obtain alpha → simulate data with alpha → fit the model → examine if the effect was detected → rinse and repeat to obtain statistical power.

However, I chose to conceptualise the effect size as the betas on my own, and that itself might be the problem. Is it? I had to take a chance because I couldn’t find a single trace of an effect size theory for the Dirichlet regression on the internet, let alone for power analysis by simulation with Dirichlet (it was hard to find ressources for the regression itself to begin with).