Multivariate phylogenetic with repeated measurements model help

I would really appreciate your input on model design for a data set that explores multiple predictors on a proportion-type response with a phylogenetic correction with multiple sampling events per species. I’ve had to concatenate several guidelines from the brms package because of the nature of this data, and I have no one to consult with on this topic.
Here’s a sample of my data:

library(MuMIn)

### Define data
M_count <- rpois(n = 10, lambda = 4)
F_count <- rpois(n = 10, lambda = 3)
total_N <- (M_count+F_count)
Prop <- (M_count/(M_count+F_count))
year <- rpois(n=10, lambda = 2015)
species <- c("t1","t2","t1","t3","t3","t4","t5","t6","t2","t6")
location <- c("loc1","loc1","loc3","loc2","loc4","loc4","loc2","loc3","loc3","loc4")
temp <- rnorm(n=10, mean = 10, sd=2.7)
ALT <- rnorm(n=10, mean = 300, sd = 220)
Lat <- rnorm(n=10, mean = 31, sd=1.2)
temp.std <- stdize(temp)
ALT.std <- stdize(ALT)
Lat.std <- stdize(Lat)
dat <- data.frame(M_count, F_count, total_N, Prop, year,t,location,temp,ALT,Lat,temp.std,ALT.std,Lat.std)
 
#add 'phylo'
dat$phylo <- dat$species

#add tree
tree <- rtree(n = 6)
tree_l <- compute.brlen(tree, method = 'Grafen',power = 1)
A <- ape::vcv.phylo(tree_l)

#Fit a Binomial Logistic Regression Model
Bayes_Model_Prop_fit1 <- brm(M_count | trials(total_N) ~ ALT.std + Lat.std + temp.std + (1|phylo) + (1|species),
                             cov_ranef = list(phylo = A),
                             data = dat,
                             family = binomial(link = "logit"),
                             prior = c(
                               prior(student_t(3, 0, 10), "sd"),
                               prior(student_t(3, 0, 20), "Intercept")
                             ),
                             warmup = 5000,
                             iter = 5000000,
                             chains = 2,
                             inits = "0",
                             cores = 1,
                             thin=500,
                             seed = 123)

Being new to this type of analyses, I really struggled with MCMCglmm before and found no solution for glmer packages that correct for phylogeny with repeated measurements.

I have a few novice questions:

  1. Is the model I employed suitable? I would appreciate any advice on syntax as well as on fitting.
  2. Should I incorporate random slopes? If so, how do I do that with all three fixed predictors?
  3. Is there a way to add weights to the model? (e.g., sampling event sample size ‘total_N’)
  4. Is there a function that runs all possible models (similar to ‘dredge’)?

Many thanks!

Hi - please take answers with grain of salt. I can’t comment on phylogenetics, but I do have some comments on your model after a quick inspection.

In a setting where observations are patients, adding a varying slope could help improve in-sample predictions, although this might not generalize well. Finding covariates to pool on could help improve model fit and prediction.

It’s recommended to add a varying slope for each varying intercept. Consider clustered data with different slopes… fitting a multilevel linear model with the same slope for each intercept could yield erroneous predictions, if the true slopes vary. I forget the syntax, but I think it’s lmer(y ~ x + (1|x), for a varying slope/intercept for x.

Priors seem too diffuse. Diffuse priors will almost certainly bias parameter estimates toward +/- infinity. Consider priors closer to the range of your data, perhaps after standardization. Adding informative priors can also help sampler efficiency.

Warm-up is high, and iterations are really, really high. Models shouldn’t need to run for days. In a well specified model, something like warmup as 1000 and iterations around 10,000 is fine, sometimes 1000-2000 iterations for debugging. A well specified model should run in minutes, even with hundreds of parameters. Don’t take my word, this is a heuristic based on experience. This isn’t considering hierarchical models with funneling issues, that usually call for increased tree depth.

I don’t know what you mean. But there’s code for post-stratification with brms sitting somewhere in a gist, which is a kind of “weighting”.

I can think of infinitely many awful models. I don’t know what dredge is.

Hi - I also came to brms after struggling with complicated phylogenetic models in MCMCglmm and for newbies like me the process suddenly became much more clear (plus Stan is great in numerous other ways!)

Sorry, I can’t really comment too much on your response variable setup - tM_count | trials(total_N) - but from my personal experience, and from advice on here and elsewhere, I do not think you need to add “random” slopes to the phylo group-level parameter in order for it to accomplish what you want it to accomplish. Because you are using a covariance matrix and the cov_ranef() command, this group-level effect is being treated quite differently that a “regular” group level effect (like species) in your model.

The species group-level parameter however may be different. I have only worked with like 1-4 samples per species. Estimating varying intercepts for each of those species like you have done has worked well for me as a way to include the interspecific variation in the model. If you have, say, 10+ measurements for each species, then perhaps it’s a different story? Adding varying slopes will increase the runtime of the model a lot (but as stated above, you still only need ~5000 generations max, and no thinning!) and may not ultimately make a difference. If you are new to brms syntax then looking at this link - an introduction to lme4’s syntax -can help as it is pretty similar to brms for the basic things. Maybe you could do (ALT.std + Lat.std + temp.std | species) for varying slopes for each parameter, but you’ll have to try.

Seconding @anon79882417’s comments on priors. In my experience phyogenetic models in brms require regularizing priors. Every time. I’ve never had any phylo model fit well at all with default priors. The function get_prior() has become my best friend, and you should definitely run it for a model this complicated and make a conscious thought-out decision for each and every parameter. This link goes over the basics of prior selection, and I read somewhere that folks want to update it soon.

I’d recommend running all of this without the group level effects, the slowly adding them in and closely looking at the posteriors using plot(your.model) and pp_check() plots to see if something blows up and how.

Not sure about question 3. For question 4, I imagine people on this forum would recommend thinking really hard about your question and what you want to estimate and avoiding dredge-like, stargaze-like thinking. That’s been a big takeaway from the Bayesian glmm world.

2 Likes

Thank you both for your detailed answers and for the great links.
However, there are still a few points that are unclear to me;

  • Is running all the possible models (including full and null) and comparing them advised against in Bayesian glmm?

  • In my case, the data isn’t sampled equally (i.e., I have a different number of observations for each species, AND the “size”/“effort” of each sampling event may be different). Shouldn’t that be accounted for in the model? In the frequentist world this is important and requires adjustments in the model in the form of adding “weights” to each sampling event.

  • A syntax question - must I add “Intercept” to the model syntax to get the fixed factors intercept modeled as well? SO far I haven’t, but its statistics are in the model summary etc.

Thanks again!

@talimc Do you mean running all possible combinations of the model you wrote out? I typically don’t do that in a Bayesian workflow. I have a model of how I think the world works based on previous experience, research articles, and conversations with experts.

Hi,

I am not exactly sure what you mean by “all models.” Generally you just want to figure out your exact question, like “What is the effect of Lat.std on M_count?” and then write a model that can answer that question. You don’t want to test every combination of variables while blindly looking for “significance.” You’ll probably find it, but it probably won’t mean anything!

I am a novice, but as far as I understand, by using “species” as a group-level effect your model then learns from all of the information provided about each group (“species”). And then the model uses that information from all groups to improve the estimates for each individual group. So the variance in the “species” with 10 samples helps the model estimate the variance for the “species” with one sample. This is how multilevel/heirachical models work, and is not a frequentist vs. Bayesian thing.

In case you aren’t familiar, I would highly recommend @richard_mcelreath’s book Statistical Rethinking. The new second version is great. This along with the YouTube videos of his courses can better explain all these wonderful things!! Learning about causality and DAGs will help you with your first question.

No, it is included unless you remove it by adding a zero like y ~ 0 + x but this is made very clear in the brm function documentation.

1 Like

@talimc

May be this modeling analogy would help.

It’s dark, and your dog is sleeping, and you need to sew a blanket to put on your dog. You have a vague guess, and you’d like know where your dog is, and that it’s snug after you’ve laid the blanket down.

So you have thread to sew the blanket with. Thread is expensive, it costs time, and time is more valuable than money. So, we take a first guess as to what dimensions to sew the blanket. I wouldn’t choose a really, really big blanket first because, I frankly don’t like sewing blankets for dogs, it’s tedious and boring, and longer threads are more expensive. I’m lazy. Remember, time is valuable.

If I sew the blanket too big, I won’t know where my dog is and I could step on it in the dark, and to find out where my dog actually is, I’d have to do a bunch of gentle patting from the edges of the blanket in the in the dark, and still might not know where it is. Too small and my dog is cold and I’m a bad pet owner, because I don’t know my dog that well. So I might do a first vague blanket to see where my dog lies, pat around the edges to see if I’ve sewn my blanket too big, and re-sew so that I can sew the best blanket that fits my dog well.

As other have mentioned, 5,000,000 iterations per chain is an overkill, as is thin = 500. If you’re getting low efficient sample size because of strong autocorrelation, then there’s probably something going on in how you specified the model. Try putting a weakly informative prior on your fixed effects (e.g. normal(0, 1)). If that doesn’t help, try doing the same for the random effects.

Given that your likelihood/family is binomial, the student(3, 0, 10/20) priors on “sd” and “Intercept” are an overkill too, because your expectation is in log odds. For example, assuming log odds 20 for the intercept (which is within 1 sd of the prior - pretty possible) would mean that you would expect the odds of your outcome to be 485165195, that means a 485165195/485165196 = 99.999999793% probability that your outcome occurs when all covariates are zero. TL;DR your wide student priors put a lot of density on extremely rare or frequent outcomes, constraining them might help. Afaik, if your outcome occurs anywhere near the 50/50 neighbourhood and you don’t expect the covariates to change the probability by more than few tens of percentage points, then a normal(0,1) prior should be fine.

Thinking about it, the extremely wide priors on intercept could be messing up your sampling if there’s a correlation between the Intercept and the random intercepts (I’ve had that happen before). Picking more informed, regularizing priors could help a lot.

1 Like

If my hunch with the priors is right, then you have a problem with colinearity, which is pretty frequent in mixed effects models.

Imagine you’re sharing a flat with two of your friends, Bob and Megan. Each one of you has a partner & only spends a few nights a week at the flat. One day, you notice that you’re spending ridiculous amounts of money on power bills each week. You wanna find out who’s wasting all that electricity.

If you have the power bills for a couple of weeks, and records of how many nights each of you spent at the flat, you can run a really basic multiple regression, predicting Powerbill_{Week} \sim Nights_{Bob} + Nights_{Me} + Nights_{Megan} . The predictor slopes in this case amount to how much each person spends per night. With few weeks worth of data, you should be able to figure out just how much each person contributes on the power bills.

Now imagine that Megan and Bob spend the same number of days at the flat each week. This is a problem. Why is this a problem? Because the statistical model doesn’t know how to partition their contribution. For example, let’s say that after accounting for your contribution, the leftover joint contribution per week for Bob and Megan is $50, meaning that together, Megan and Bob are estimated to spend $50 on power. Now, how will you split those $50? The model might find it plausible to estimate that Megan spends $10 dollars and Bob $40, Bob spends $1 and Megan $49, or even that Bob spends $5,000,000 but Megan preempts that huge bill by paying the company $4,999,950 upfront each week.

Now, you as a human might say that the most likely scenario is that Bob and Megan spend about the same amount of money each week, and anything more extreme either way is increasingly less likely. However, the model doesn’t know that. For the model, the human-friendly $25/25 split is just as reasonable as the $Inf/-Inf + 50 split. That’s why maximum likelihood estimation methods commonly used in frequentist analyses go haywire in the presence of strong collinearity.

To explain to the model that the $Inf/-Inf + 50 split is crazy, you need to regularize. In Bayesian context, that means using sensible priors to restrict the parameter space to sane values. For example, you might take the average amount of money per week, divide it by three, and find that it’s $30 dollars. You might feel that it’s pretty unlikely that one of you spends less than $10 or more than $50, so you set a normal(30, 10) prior on your regression slopes. And voila, all estimates that stray away from $30 become increasingly more unlikely. The model doesn’t consider someone spending Jeff Bezos’ yearly salary on power each week a sane option, and as a result becomes more well-behaved.

1 Like