Fitting mixed effect GAM model in brms and setting of priors

Hello ,

I am new to Bayesian modelling. I am an ecologist. I am trying to model the visitation frequency of a species of bird to a species of plant. My dataset has several variables: frequency_f (response variable), elevation (numerical variable with discrete values), season (categorical variable with two levels), flowers (numerical variable also with descrete values). I want to run a mixed effect GAM model with a smoothing term for elevation using brms R package. I decided to set a smothing term for elevation after checking the relationship of these two variables via a plot. Frequency of visitation seem to peak at the middle elevations and then it drop again. This is the formulation of the model:

elevation_all_plants_mbrms ā† brms::brm(frequency_f ~ s(elevation) + season + flowers + (1|plant_ID),
data = hypericum_all, family = hurdle_gamma(link = ā€œlogā€), chains = 5,
iter = 10000, warmup = 3000, cores = 4)

I chose the hurdle_gamma family since my response variable is continuous and itā€™s zero inflated. It canā€™t take negative values. Thus, I want to see if the visitation frequency of the birds is affected by elevation, season, and the number of flowers each plant has. As random factor I chose each plant individual. After running the model with default priors, these are the results :

Family: hurdle_gamma 
  Links: mu = log; shape = identity; hu = identity 
Formula: frequency_f ~ s(elevation) + season + flowers + (1 | plant_ID) 
   Data: hypericum_all (Number of observations: 242) 
  Draws: 5 chains, each with iter = 10000; warmup = 3000; thin = 1;
         total post-warmup draws = 35000

Smooth Terms: 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sds(selevation_1)     2.03      1.16     0.56     5.05 1.00     1111      420

Group-Level Effects: 
~plant_ID (Number of levels: 242) 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)     0.33      0.25     0.01     0.88 1.01      569      285

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        -2.33      0.26    -2.83    -1.84 1.01     1528     2375
seasonWetMdry     0.28      0.24    -0.17     0.75 1.00     2398     1209
flowers           0.00      0.01    -0.01     0.02 1.00    11113     4558
selevation_1      0.55      3.05    -6.11     6.13 1.00     4797     1892

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
shape     2.47      2.98     1.28     9.73 1.01      520      216
hu        0.54      0.03     0.48     0.60 1.00     7140    18740

If I understood well the hurdle_gamma family will model the zero VS non-zero, and then the non-zero values will be modelled. Thus, I will get a probability of non-zero and then the estimates for each of the effects for the non-zero values. Is this correct? Is the choice of this family correct? None of the tested effects plays a role in the visitation frequency of birds. However, now I want to run the model again with informative priors:.Here it is the code :

prior1<- c(set_prior("normal(0.1031292, 0.5)", class = "b", coef = "Intercept"),
           set_prior("uniform(2270,3530)", class = "b", coef = "s(elevation)"),
           set_prior("normal(0,1)", class = "b", coef = "season"),
           set_prior("truncated_normal(0, 12.33058, 11.98781)", class = "b", coef = "flowers"),
           set_prior("normal(0,1)", class = sigma))

I chose those priors based of some descriptive statistics of my data and my knowledge of how it should be. Thus, I considered that the Intercept of the model canā€™t be negative and it should be centered around the mean. For Elevation I set a uniform probability for every elevation and I gave the range of elevation present in my dataset. For season, I set a big SD to account for the effect of the two seasons. For the number of flowers, again it canā€™t be negative and it should be cantered around the mean and with the SD of the actual data . Finally, the random effect has a huge SD to account for each individual plant present in the dataset.

I do not know if this makes. Can someone help me out?

Also, when I try to run this in R I got the following error:

> prior1<- c(set_prior("normal(0.1031292, 0.5)", class = "b", coef = "Intercept"),
+            set_prior("uniform(2270,3530)", class = "b", coef = "s(elevation)"),
+            set_prior("normal(0,1)", class = "b", coef = "season"),
+            set_prior("truncated_normal(0, 12.33058, 11.98781)", class = "b", coef = "flowers"),
+            set_prior("normal(0,1)", class = sigma))
Error: Processing arguments of 'set_prior' has failed:
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
  arguments imply differing number of rows: 1, 0

Can someone help me out to make it work?

I am running brms in:

R version 4.3.1 (2023-06-16 ucrt) ā€“ ā€œBeagle Scoutsā€
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ā€˜license()ā€™ or ā€˜licence()ā€™ for distribution details.

R is a collaborative project with many contributors.
Type ā€˜contributors()ā€™ for more information and
ā€˜citation()ā€™ on how to cite R or R packages in publications.

Type ā€˜demo()ā€™ for some demos, ā€˜help()ā€™ for on-line help, or
ā€˜help.start()ā€™ for an HTML browser interface to help.
Type ā€˜q()ā€™ to quit R.

Brms version:

packageVersion(ā€œbrmsā€)
[1] ā€˜2.19.0ā€™

1 Like

Hi Guillermo, welcome to the Stan forums and sorry nobody responded sooner.

I think you need class = "sigma" (with quotes) instead of class = sigma. That should at least get rid of the error message.

Iā€™m not actually sure how brms implements this, but yes your description does sound like a hurdle model.

A few comments:

  • If the intercept really canā€™t be negative (as opposed to just probably isnā€™t negative) you can set lb = 0 in the call to set_prior().
  • It looks like youā€™re modeling elevation with a smooth term s(elevation) and Iā€™m not actually sure how to set the prior on that in brms. Maybe your code will work, but you can also probably find examples on the forum or in the documentation of other people setting priors on smooth terms. And at first glance Iā€™m not sure a uniform distribution on the range of values in your data makes sense here. The idea presumably is to model the effect of elevation, but that doesnā€™t mean that the values of the estimated parameter representing that effect should be of the same size as the values in the data.

All that said, I unfortunately know basically nothing about ecology models, so Iā€™m not sure I can offer any additional advice on how to model this data. But we do have some ecologists here on the forum so maybe someone else will chime in. I know @jsocolar would probably have some good advice, but I almost feel bad tagging him because he already responds to so many posts here ;) Maybe if @jsocolar doesnā€™t have time he might also know of somebody else on the forum who could be of more use in answering questions about ecology models.

2 Likes

Hi @Guillermo_Uceda, sounds like a cool application! This is a question that touches on a bunch of different parts of brms, which is good news and badā€“good because it sounds like you are taking full advantage of the capabilities of brms to fit this model, and bad because the answer is going to be a little bit complicated and hard to digest.

Your approach certainly seems reasonable, but to really figure out whether this model is appropriate it would be good to look at some posterior predictive checks. Ultimately, the data are your best guide (albeit an imperfect one) to whether your model is adequate.

Distributional regression

Many response distributions in generalized linear models have two or more parameters. For example, normal distributions have a mean and a standard deviation. Commonly, we regress the mean on covariates and assume the other parameters (like the standard deviation) are homogeneous across all values. In brms it is pretty easy to relax this assumption, as described in this vignette: Estimating Distributional Models with brms. In particular, I wonder whether it might make sense for you to regress the hurdle probability on covariates. I think that this distributional parameter will be called hu, and you can include it in just the same way that zi gets treated in the zero-inflated example in the vignette linked above.

Priors on coefficients

It seems like you are tempted to place priors on coefficients based on the distribution of the covariate. This isnā€™t right though. The prior is not about the probable values of the covariate; it is a prior on the associated effect size (i.e. the value of the associated coefficient). So for example, the number of flowers canā€™t be negative, but do you also intend to be absolutely certain a priori that the visitation rate cannot possibly go down as the number of flowers increases? And the effect size shouldnā€™t be centered around the meanā€“it should be centered either around your a priori belief about how much an increase in flowers should affect visitation (if you are using informative priors) or around zero (if you are using weakly informative priors for regularization only).

Priors on the intercept

One thing to watch out for in brms is that by default the prior that you define on the intercept will be the prior that you define on the intercept when all covariates are held at their means, not when all covariates are held at zero. However, for certain special classes of effects, things can be a bit more complicated, and Iā€™m not 100% sure that the default behavior is guaranteed to be the prior when all covariates are held at their means in the presence of a spline in the model. To be safe, I would suggest using the 0 + Intercept syntax described here to enable you to put priors on the true intercept (i.e. with covariates held at zero): https://cran.r-project.org/web/packages/brms/brms.pdf (see page 44). If you want to define an intercept with covariates held at their means, you can always manually center the covariates before passing them to the model, and still use the 0 + Intercept syntax. Also, I think this issue might have something to do with the error you see (or could yield another error once you fix your error) because in the prior specification class = "b", coef = "Intercept") should only work when using the special Intercept keyword in the model formula.

Priors on splines

This part is conceptually hard, and thereā€™s no way around it. The best resource on understanding how to set the priors is this thread, and particularly the post from Gavin Simpson Better priors (non-flat) for GAMs (brms). However, to understand that post, youā€™ll need some notion of splines as random effects, and for that a good resource is this wonderful post by @tjmahr Random effects and penalized splines are the same thing - Higher Order Functions, but even with this post as a guide I find Bayesian splines to be quite challenging conceptually.
Something that is definitely true is that set_prior("uniform(2270,3530)", class = "b", coef = "s(elevation)") is not achieving what you want. Itā€™s not the right syntax for setting a prior on a smooth, and there is no reason that the range of elevations observed in the data should correspond at all to your prior on the effect size for elevation.

2 Likes

Great answer @jsocolar, thank you.

We do the same thing in rstanarm but I still often forget about this.