Over on the brms forums, I’ve been discussing some differences I’ve found between brms and rstanarm. That thread (including links to data, models, and some plots and printouts of the underlying problem) can be found here
With the gracious help of @paul.buerkner, I think I’ve narrowed down the difference to be attributable to differences in the prior for the group-level variance. Is it the case that the default group-level variance prior in rstanarm is much different from the default group-level variance prior in brms?
If it is, I would next like to think about how to compare the prior specified in rstanarm and that specified in brms. I believe that the brms prior is t-distributed. I don’t have much of a problem reasoning about t-distributions. However, the description of the rstanarm prior for variance found here is more difficult for me. If my reading is correct, I think if I were to adjust the concentration argument of decov() to be greater than 1, that this would lead to greater regularization of the group-level variance (and correspondingly, to obtain behavior that looks more like brms, I could set it to be less than 1). Is this correct?
Finally, in the context of this model, where one of the group-level effects has a low number of groups (5 sites in my case), how do I determine what a reasonable value is for this prior?
The group-level variance isn’t a parameter in rstanarm, and I don’t think it is in brms either. In the case where only the intercept differs by group, then the standard deviation has a gamma prior in rstanarm (with shape and scale both 1 by default) and I guess it is half-t in brms. For the default values, I guess the two are not so different but if you specify the shape of the gamma prior to be greater than 1, then there is no mass at zero and the mode is interior.
In the case where both the intercept and one or more of the coefficients vary by group (for a total of K group-specific parameters), it is more complicated. The aforementioned gamma prior pertains to a scale parameter, which is squared, multiplied by K, and then multiplied by each margin of a simplex to form the K variances as a transformed parameter. This construction is pretty different from specifying independent half-t priors on the standard deviations. You can specify the concentration parameter to be greater than 1 in order to squeeze the margins of the simplex closer together or less than 1 in order to make the margins of the simplex more polarized, but neither makes anything more closely resemble independent half-t variates.
If there are only 5 sites, then there is not much information in the data with which to estimate hyperparameters, particularly if K > 1. If K = 1, then think through what you believe the prior mean and median of the group-level standard deviation to be and use
to work back to the shape and scale. If K > 1, then the scale parameter is something like the average standard deviation across the K variables. Also, in the K > 1 case, you probably want to choose values of the concentration parameter and regularization parameter (in the LKJ distribution over correlation matrices) to be greater than 1.