Setting custom beta family in brms

Hello, I a fitting a hierarchical model in brms for an ecological meta-analysis. I am interested in magnitude of the effect size so for the response in the model I am using the absolute values of the effects sizes as opposed to both the positive and negative effects. This makes perfect biological sense for the study, which I won’t go into here. See the two plots below for a visual representation of this:

Before absolute value
Plot wo ABSval

After absolute value:
Plot w ABSval

As you can see from the second plot, the distribution of the data I am fitting is mostly near zero but will never be below zero with a tail extending out to ~14. As such, I have been playing around with a couple families (lognormal, beta, and normal (which is very clearly wrong, but there for comparison)) to see which distribution would be best for my data. It also seems like halfnormal or negative exponential distributions may be appropriate, neither of which are native to brms but I would be interested in trying to use with leave one out or kfold model comparison.

Because the beta distribution inherent to brms works on probability it’s bounded between 0 and 1. I know from a couple of sources (Ben Bolker’s book in particular) that the beta distribution is extensible beyond 1, and I would like to make a custom family that does so that more appropriately matches the distribution of my data. I’ve looked at this resource https://cran.r-project.org/web/packages/brms/vignettes/brms_customfamilies.html but unfortunately don’t have enough experience with this stuff to move from the provided example to a beta distribution.

Currently, as a work around to be able to compare models using leave one out (loo() in brms) I am fitting my intercept only model where the response is divided by 100. This seems obviously wrong, but I am not sure how wrong. Here is a quick bite of that code without the other model specifications

brm(
  effect.size/100 ~ 1 + (1|Paper.Name/Trait/Experiment))

Using this method, loo() suggests that the beta distribution is the better preforming model:

Model comparisons:
                     elpd_diff se_diff
my_model_int_beta       0.0       0.0 
my_model_int_lognorm -435.9       4.1 

Also, strangely, when I run the model with a lognormal distribution I end up with a mean posterior distribution less than zero, which seems a bit strange given the bounds of the family (and that it appears there is no link function), but like I have indicated earlier this all fairly new to me.

Thanks in advance for your help.

Hi! Welcome to the Stan forums!

Unfortunately, I don’t know anything about ecology models and rarely use brms. But maybe @mbjoseph can help you with your model? I don’t know, if he uses brms, though…

Cheers,
Max

Hey @msparks! I haven’t used brms much, or implemented custom families but based on the link you posted, it seems like you’d need the probability density function (PDF) for a custom distribution. Do you know what this PDF is? If you can express it in math notation, then it might be easier to help with implementing it in brms.

Second, assuming that none of the effect sizes equal zero, I wonder whether a gamma distribution for the distribution of the absolute effect size provides an easier alternative (because it respects that y>0, has no upper bound, and is currently implemented in brms).

2 Likes

Thanks for the response, sorry for the long delay–I had some personal issues that came up. We did implement the gamma distribution. The LOO actually liked the beta the best using the methodology where we divided the response by 100, but again I’m not sure it’s the most appropriate. The gamma actually preformed worse than the lognormalScreenshot 2020-04-13 10.25.15.

I believe you’re right about coming up with the PDF, but I am struggling to find the best way to solve for the parameters needed for a given distribution (e.g., alpha and beta for beta, shape and scale for gamma). I obviously know the mean and sd of my data, but haven’t been able to find a good solution for solving for those in R.

Hey, it’s getting late here so just two quick thoughts:

  1. It might be useful to do some PPCs of observed effect sizes vs. expected. It’s hard to see what’s going on just from the LOO results.

  2. You could also consider fitting the model on the original data, even if you’re just interested in the absolute values and compute the magnitude of the effect size in the generated quantities block (or just afterwards in R when using brms).

@paul.buerkner Tagging you here like you asked in your email, thanks!

Does your data has a theoretical upper bound? If not, any kind of scaled beta (which assumes such an upper bound) is probably not sensible. A convenient family could be the exgaussian() family. It does allow for a lot of right-skewness and has a simple linear mean parameterization. It doesn’t respect the lower boundary though. Alternatively, using hurdle_lognormal could be an option to include zeros in the lognormal distribution. However, this assume zeros to come from a different process than the rest of the data, which may also not be entirely what you want.

1 Like

Paul, thanks for the insight. Indeed, I don’t think there would be a theoretical upper bound to my data. I ran those additional families and the loo is still preferring my beta family models. It doesn’t seem liking turning my response into a decimal by division of the same number (e.g., 100) is an appropriate approach. Also, my Pareto k values seem to mostly come out around the 0.5 - 1 range, which also seems problematic based on your documentation. Runnign the kf

Also, any idea why I might be getting negative posterior mean estimates when using a lognormal distribution and only positive input data? Obviously, my data lump around 0 but all are positive which seems counterintuitive to me.

How did you find that the beta models are preferred? Loo is not valid if you fit the models to different responses (even if they are linear transformations of one another).

Ah okay. I didn’t realize it functioned similar to AIC in that respect (though it seems more obvious now). Then the beta models would be out and I’m left with the others on this list. Which indicates the lognormal is still the best.

Screenshot 2020-04-21 13.19.39

But it still seems weird I am getting negative mean posterior estimates from the lognormal model given it’s bounds. But I haven’t run models within that family before, so maybe not as strange as it seems. (Just a note, this model didn’t mix as well as similar ones in the past with same iterations, but the estimates are pretty exactly the same.)

lognorm model

Again, thanks for your help on this.

AIC is an approximation to loo in fact so they are related indeed. Negative predictions for lognormal cannot be right. Do you have a minimal reproducible example for me?

I will put together something this afternoon and post it. Thanks!

Here is the data with only the necessary pieces in a .csv file. Columns are average effect size, the variance of that effect size, paper the data was taken from, the trait being measured (multiple at times from a paper), and the experiment the measurement came from (sometimes multiple for a trait and paper).

simple_data.csv (16.9 KB)

Using brms 2.10.0 and R 3.5.1

Here is the code used to run the model:

model_int_lognorm <- brm(
  avg_effsize ~ 1 + (1|paper/trait/experiment_num), 
  data = simple_data,
  family = lognormal(),
  iter = 80000,
  warmup = 7000,
  cores = 4,
  control = list(adapt_delta = 0.999, max_treedepth = 15)
)

And then the summary() and plot() associated with that model:

So you mean some coefficients were negative? This is totally ok since lognormal models data on the log scale. I thought that actual predictions (via posterior_predict) would have been negative, which would have been a bug.

I see, so if I exponentiate the coefficients estimates from the above table and the distribution of those estimates I will get the posterior predictions in a linear scale.

So for example:

post_samples <- posterior_samples(model_int_lognorm)
hist(exp(post_samples$b_Intercept))

Hist