New to brms and categorical modeling - dealing with messy biological data - insights and advice welcome

Hello everyone,

This is my first post here. I am a graduate level mycologist who specializes in fungal mating/nuclear behaviour and a beginner working with brms and Bayesian modeling. These topics are far from my area of expertise and there are limited resources at my institution to meaningfully assist with my work. I’ve been getting my teeth into these concepts for about 3 months.

I wanted to make a post here to begin a discussion about my work. Henceforth, I’ll describe my data, model, and reasoning. I will try to be as brief as possible, but that wil be difficult. All insights, critiques, suggesting are very welcome.

To begin, I’ll paste my base model here:

fit ← brm(

  • formula = Nuclear_type ~ Maturity + (1 | Replicate),
    
  • family = categorical(),
    
  • data = Fully_Expanded_Spore_Data,
    
  • chains = 4,
    
  • iter = 4000,
    
  • cores = 4
    
  • )

This data consists of 11090 spores that I examined for how many nuclei they contain (single nucleus, two nuclei, three nuclei, or four nuclei). This is termed “Nuclear Type”. I found that over the course of an individuals life it produces different proportions or ratios of these. This is a new phenomenon I’m describing, thus, meaningful priors are hard to find and I’ve stuck with default priors. I separated the maturity of individuals into three categories termed “Young”, “Middle” and “Old”. In total, I examined 15 individuals (Replicate) and only 7 of them survived through all maturity categories, while the remainder only existed in either 1 or 2 of them. Because of this, the individuals (replicates) are treated as a random effect as they are uneven.

I decided to go with this model because of how brms handles partial pooling and is able to draw information from the data as a whole rather than isolating individuals that are confounded within maturity categories. It seems to handle unbalanced data with grace. Is my reasoning sound?

I’ve been able to conceptualize the log odds coefficients and have been able extrapolate them algebraically to examine all pairwise comparisons beyond the base model outputs. Is it worth showing this off? or should I just keep this in my Appendix?

I was able to convert log-odds to predicted probabilities using these functions:

new_data ← data.frame(Maturity = c(“Young”, “Middle_age”, “Old”))

preds ← fitted(fit, newdata = new_data, scale = “response”, re_formula = NA)

Then I summarized the output in a tibble with point estimates. I’ve seen people use posterior_epred() more commonly, but fitted() seemed appropriate here for marginal estimates.

Let me know what you guys think! Again thank you for taking your time to read over this; I’m in the weeds.

Cheers!

Ben Bohemier

MScF Candidate

Lakehead University

Hi Ben,

Welcome to the community!

Here are a few questions and comments that occurred to me as I read through your post. I don’t know anything about mycology, but hopefully some of this will be useful.

  1. You’re model treats Nuclear_type as a categorical outcome. Since this is a count of Nuclei, would it make more sense to treat Nuclear_type as an ordinal outcome?

  2. Is Maturity coded as an ordinal variable in the model? If not, should it be? And, if it is ordinal, do you expect its effects to be monotonic?

  3. Also, where do the Maturity categories come from? Are you taking some continuous measure(s) or a more fine-grained set of categories and then collapsing them into the maturity categories or, alternatively, are there really just three natural categories? If the former, then collapsing down to three categories throws away information (see, here for example).

  4. You said that 7 of 15 individuals survived through all of the maturity categories. Given this, would it make sense to account for the effect of censoring in your model?

  5. Regarding priors: You probably know more than you think about the possible values for your model parameters. Are there some ranges of values for the Maturity parameters that give physically implausible/impossible values for the outcome?

    For example, just making up a hypothetical example for illustration, if the parameters for Young, Middle or Old are greater than, say, 10, do you end up with an implausible or impossible distribution of Nuclear_type (say, 99% with 4 nuclei). You can check this with prior predictive simulation and set your priors accordingly. For example, if you did decide that parameters greater than 10 are highly implausible, you could set a prior of, say, normal(0, 5). I’m not sure how informative your data are. You have lots of replicates, but only 15 individuals. A set of priors that exclude impossible outcomes might help give you more stable and generalizable inferences.

    In any case, it’s unlikely that the default flat prior for the fixed effect parameters is optimal. You’ll likely get better inferences if you at least use a weakly informative prior. Here are some recommendations on setting priors.

2 Likes

Hey Joel,

Thanks for getting back so soon and providing insightful feedback and questions. Again I’m not an expert, so I’ll try and answer as well as I can.

  1. Good catch and I wasn’t super clear on how I organized my data. What I ended up doing was organizing my data in that each spore that I observed occupied a single row. Therefore I ended up with 11090 rows with columns that describe Individual, Maturity category, and Nuclear type (1N, 2N, 3N, or 4N). This makes each spore apart of a nuclear type category so to speak. Although the nuclear types are numerically ordered (1–4), they don’t occur in a stepwise biological progression but instead their variability is driven by the orderly development of individuals.

  2. It is not handled an an ordinal predictor. The term monotonic is new to me and after learning what it means, I do expect the maturity effects to be as such. There is an orderly progression from young to old that is not evenly spaced. The jump from young to middle is more significant than the jump from middle to old.

  3. Maturity categories are in fact collapsed from a finer scale. The reason why I did this is because I had initially planned to sample everyday but this was not possible with lab complications etc. If I had more time, and this was a PhD not a masters, I would have sampled more, but unfortunately I needed to move forward with what I had. The maturity categories are ranges of days and are “artificial”: young (1-3 days old), middle (4-8 days old), and old (9-10+ days old). I did this to simplify represent my data more cleanly, as sampling was quite sporadic, but could be contained in these categories. I’m not sure how else I could do it. I acknowledge that yes, I may be losing some information, but for this degree I think it will suffice. Unless you have other suggestions!

  4. The individuals that were not sampled in all 3 categories were sometimes sampled in only young, young and middle, or only old. This was the case because either the individual died or it was only sampled when it was quite developed. What I’ve done to account for this is to consider a random effect termed “Replicate”. (1 | Replicate). I this to try and account for the fact that there may be an effect driven by which individuals exist in given maturity categories. If that answers your question at all. From what I understand about censoring from just reading about it, I dont know if its the best fit here.

  5. Thanks for that suggesting. I think that could be a good approach to stabilize the model. There does seem to exist a Chinese paper that describes a similar phenomenon in this species but with very little detail. Would it help to use the information they provided even if it only discuses proportions of one nuclear type?

Additionally, the total counts I made for spores varied each day that I counted for a specific individual at a specific maturity. So that drives more unevenness. Overall, this data was collected when I was trying do develop a new protocol under time constraints, while also trying to meaningfully describe a complex phenomenon; it got messy. But appears that there is still valuable information here. Thanks again, I hope you have a better sense for my model and dataset.