Here is a paper (https://arxiv.org/pdf/1808.06399.pdf) that depicts the cross-over between RStan and DirichletReg. From what I can tell, RStan and consequently brms, use the alternative parameterization, although I have not been able to find any documentation other than this “This provides essentially K degrees of freedom, one for each dimension of alpha, and it is not obvious how to specify a reasonable prior for alpha. An alternative coding is to use the mean, which is a simplex, and a total count.” from Stan User’s Guide.
My questions are:
Which parameterization does brms use when family = “dirichlet”?
Assuming the alternative parameterization is used, is there a way to specify the common parameterization for Dirichlet regression using brms (or vice versa)?
You can always use make_stancode in brms to inspect the generated model. This is usually enough to check what exactly is brms doing under the hood.
Implementing custom multivariate distribution in brms is possible, but requires some hacks. If you decide you really need it and are OK with writing some code in pure Stan, I’d be happy to guide you through the main ideas.
@martinmodrak Thank you for the response. Unfortunately, my understanding of the actual statistics and stan language behind the models is limited. I looked at make_stancode for a simple version of my model and it is beyond me.
I feel fairly confident that the default setting though is the alternative parameterization based on the output providing an estimate for phi.
With that being said, I was just hoping someone with more knowledge of the subject and stan/brms could verify my guess.
I appreciate your offer. The main reason I am interested in whether or not there was a simple way to implement the common parameterization (assuming default is the alternative) is because I am finding it difficult to determine what the estimates, once multiple predictors are included, are in reference to (Dirichlet regression requires a reference category). I do not necessarily want to use the common parameterization, especially if I am able to better understand the output of my current model(s). I am happy to provide a model output to depict the issue I am having if you are interested.
Yeah, it can be quite intimidating. There usually is a very impenetrable code to construct the linear predictors which looks quite alien to what most humans would write, but then once the predictor arrays (e.g. mu) are constructed, you can usually clearly see how the likelihood part is built. But completely agree that it can be far from obvious :-)
The relevant part I see in the code is this:
functions {
/* dirichlet-logit log-PDF
* Args:
* y: vector of real response values
* mu: vector of category logit probabilities
* phi: precision parameter
* Returns:
* a scalar to be added to the log posterior
*/
real dirichlet_logit_lpdf(vector y, vector mu, real phi) {
return dirichlet_lpdf(y | softmax(mu) * phi);
}
}
....
model {
...
for (n in 1:N) {
mu[n] = transpose([0, muy2[n], muy3[n]]);
}
for (n in 1:N) {
target += dirichlet_logit_lpdf(Y[n] | mu[n], phi);
}
}
So what happens is quite similar to categorical/multinomial regression. The linear predictor for the first response is fixed to 0 on the logit scale and the others represent changes against this category on the logit scale.
One way to get around difficulties in interpreting coefficients in the model is to use model predictions to make inferences. Interpreting single coefficients also corresponds to a prediction task (roughly “what would be the difference on the logit scale if I only changed this predictor and there was no noise”). But you can use posterior_epred or posterior_predict to build predictions suited to the question you are asking (e.g. what would be the observed difference between groups if we replicated the experiment, but allocated all participants in the same group)
@martinmodrak thank you very much for taking the time to write this up, I really appreciate it.
I am still unable to tell from your supplied code whether the common or alternative parameterization is being implemented. My guess is the alternative since I see phi in the code. Do you know which form you wrote the code for?
I will definitely look more into posterior_epred and posterior_predict, as you suggest.
Sorry, could have been clearer. I am convinced brms uses the “alternative” parametrization as described in “Bayesian Regression for a Dirichlet DistributedResponse using Stan” and the first component is always taken as the reference.
In regards to the reference category, I have supplied a model output (just the parameter names to simplify things) in hopes of better understanding what the estimated parameters are being compared to in this specific regression type. The reference category is Traveling. Therefore, I assume that every estimate is in relation to it (muTraveling_Intecept for population level and sd(muTraveling_Intecept) for the group level).
When looking at the population level estimates, the first level of the predictor “Day Period” is missing as well. That leads me to assume the reference is actually muTraveling_Day Period 1. If that is the case, does that mean any parameter with the second predictor “Ave_Herd” is being compared to muTraveling_Day Period 1 (essentially the effect of “Ave_Herd” with the predictor “Day Period” held constant/at its reference level)? If so, I can understand that but where then are my estimates for the effect of Day Period 1 on the levels of my response category compared to muTraveling_Day Period 1? Since those are missing, I wonder if the reference is not Day Period 1 for each level of the response. But if that is the case then why are there no estimates for Traveling (if it is not actually the reference category in that scenario)?
Samples were drawn using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
I think this is best understood by thinking about a dirichlet regression for two categories (which is actually a beta regression).
Assume our model is just:
bf(bind(Traveling, Other) ~ Day_Period, family = "dirichlet")
The model will only have coefficients muOther_Intercept, muOther_Day_Period2 and muOther_Day_Period3. Why? Because once we choose a period, the response in Other completely determines the response in Traveling, as Other + Traveling == 1. Any gain in Other must be counteracted by equal gain in Traveling. So for two categories, we need one number for each covariate combination to fully describe the model. In fact, we wouldn’t model this case as a Dirichlet regression, but rather as beta regression where it is no surprise to have only a single predictor.
Now, when we have 3 categories, the outcome for each covariate combination is described by 2 numbers as all the responses need to sum to 1. So once we have predictors for two categories, the reference category is completely determined and does not need any predictors. The interpretation however becomes more difficicult as an increase in say Vigilant can be compensated by either a decrease in Other, Traveling or both, depending on the other coefficients and this introduces interactions between the model terms. That’s why using posterior_epred might be more direct as it automatically translates those interactions into expected values on the outcome scale. However, you’ll note that on the outcome scale an effect of a predictor depends on the values of other predictors, which brings some additional difficulties.
I understand what you are saying in regards to 2 numbers only being needed to describe a response with 3 categories (example 2). That is, the values must sum to 1 and therefore the 3rd value is the remainder of 1 minus the other two.
I am understanding correctly that Day Periods 2 and 3 determine Day Period 1 within a response category and Day Periods 2 and 3 across the non-reference categories (Vigilant and Other) determine Day Period 2 and 3, respectively, for the reference category (Traveling)?
I do not quite understand what is meant by “So once we have predictors for two categories, the reference category is completely determined and does not need any predictors.” Since estimates for predictors do not sum to 1 (as do the response variable values), I am not sure how the reference category estimate is determined. That is, the estimate for Day Period 1 could be 5 and the estimates for Day Periods 2 and 3 could be 2, .5 respectively.
In your second example, how would you describe the estimate given for muOther_Day Period 2?
posterior_epred gave me 16,000 values (the number of iterations I ran) for each response category. Do I take the mean of those?
I have found plotting conditional_effects to be very helpful to visualize the results. Since both Day Period 1 (for all response categories) and Traveling (all levels of Day Period) have data points on this plot, my assumption is that the levels of the response and predictor are not in relation to the reference levels anymore but rather are the “true” effects of the predictors. Is this correct?
I don’t think I understand exactly what you mean, but my hunch is that this is not correct.
To define a Dirichlet distribution in K dimensions, we need the precision (phi) and a K-simplex (set of K numbers between 0 and 1 that sum to 1). A K-simplex is completely determined by K-1 values. Since we want our linear predictors to be unbounded, brms (and other packages) uses the softmax function to map from an unbounded K-dimensional vector to a K-simplex. The softmax function for a vector y is:
I.e. all elements are exponentiated and then the vector is divided by the sum of the exponentiated values to enforce that sum is 1. You’ll note that from the properties of the exponential, if you add the same number to all elements of the vector y, you get the same result, i.e. if c is a single constant (not a vector) \text{softmax}(y + c) = \text{softmax}(y). So even if we restrict the first element to be 0 (as brms does) we have not “lost” any possible result of the softmax.
In the full model, we want to construct a (possibly different) Dirichlet distribution for each combination of our covariates (e.g. Day_Period = 3, Ave_Herd = TRUE, Individual_ID = 338, Session = 5 is a combination of covariates). Since phi is assumed fixed, we need to somehow “predict” only K-1 values. Those values can be interpreted as the mean proportion relative to the first category on a logit scale. The first category here is treated as a fixed point and is not estimated for any covariate combination. Once again this is similar to the binary case - when I estimate an increase in category 0, it implies a decrease in category 1. In 3-dimensional case, an increase in predictor for one category means that this category will be more prevalent relative to the reference category, but if the predictor for the final category increases even more, than in fact the absolute prevalence of the second category can drop (since the absolute prevalence of the refernce category drops even more).
What I would expect you see on the conditional_effects plot are those relative effects that are super hard to interpret on their own. To given an example how this could look like with the formula bf(bind(Traveling, Vigilant, Other) ~ Day_Period, family = "dirichlet")
muVigilant_Intercept <- 3
muVigilant_DayPeriod2 <- 1
muOther_Intercept <- -1
muOther_DayPeriod2 <- 5
# Predict the mean vector for case DayPeriod = 1
linpred_DayPeriod1 <- c(0, muVigilant_Intercept, muOther_Intercept)
linpred_DayPeriod1
#> [1] 0 3 -1
brms:::softmax(linpred_DayPeriod1)
#> [,1] [,2] [,3]
#> [1,] 0.04661262 0.9362396 0.01714783
# Predict the mean vector for case DayPeriod = 1
linpred_DayPeriod2 <- c(0, muVigilant_Intercept + muVigilant_DayPeriod2,
muOther_Intercept + muOther_DayPeriod2)
linpred_DayPeriod2
#> [1] 0 4 4
brms:::softmax(linpred_DayPeriod2)
#> [,1] [,2] [,3]
#> [1,] 0.009074715 0.4954626 0.4954626
So despite the coefficient for muVigilant_DayPeriod2 being positive, the actual proportion of the the Vigilant state decreased. (but the relative proportion to Traveling increased).
I am not sure I can explain this much better, so hope this clarifies more than confuses :-)
@martinmodrak I believe I’m finally understanding what you getting at. Thank you for sticking with me. If you wouldn’t mind answering one more question, I would greatly appreciate it.
Could you please describe the estimate of Day Period 2 on Vigilant in your example? That is, how would one properly state Day Period 2’s effect on Vigilant (if it is easier, it can be in broad terms such in increases, decreases, etc., rather than converting from the logit scale). This may be harder done than said but I think it could go a long way in succinctly explaining the estimate’s relationship to the reference categories (of the response and the predictor).
That’s actually harder, because the parameters don’t have a good interpretation outside of the logit scale.
I’ll try another code demonstration - starting with the same code as above to construct the mean probability vectors I’ll show how we can get back to the linear coefficients:
So the coefficients reflect the (changes in) logarithm of the odds of the non-reference categories against the reference category.
In particular, whether a category “increases” in the response can only be judged by looking at coefficients for all the categories. This is where using predictions comes in handy…