Model specification and approach: dist model/mixture models don't recover the structure of the data accurately

Hi,

I am working with a dataset where I have measured traits on multiple plant species and populations. For most of the traits I measured, I was able to fit appropriate models and posterior predictive checks corroborated this - chains mixed well and converged (there were still a few divergent transition warnings, but I am not too worried about those).

However, there is one trait where I have a problem figuring out the right approach. It is a continuous positive number, technically bounded (as it is time until midpoint of growth). Unlike other traits I measure however, this turned out to be multimodal. This is all fine and good, as there are biological explanations for this.

My first instinct was to try fitting a distributional model, as upon splitting the data into species, the density plots looked unimodal enough.

prior_norm = c(
  prior(normal(0, 1), "b"),
  prior(student_t(3, 3.4, 2.5), "Intercept"),
  prior(student_t(3, 0, 2.5), "sd"),
  prior(student_t(3, 0, 2.5), class="Intercept", dpar="sigma"),
  prior(student_t(3, 0, 2.5), class="sd", dpar="sigma"),
  prior(normal(0, 1), class="b", dpar="sigma")
)
f1 <- brmsformula(xmid ~ elev_Org_m * treatment * delta_elev + 
                                        (1 | gr(phylo, cov=A)) + (1 | species:region),
                             sigma ~ elev_Org_m * treatment * delta_elev + 
                                         (1 | gr(phylo, cov=A)) + (1 | species:region), 
                             family = lognormal)  

#elev_Org_m and delta_elev describe the elevational origin of the species and plant 

m <- brm(
  f1,
  data = d,
  data2=list(A=A), 
  prior = prior_norm,
  chains = 4, cores = 4,
  iter = 4000, warmup = 1000
)

However, this is what the pp_check() looks like:


While the posterior does display bimodality, it doesn’t recover the actual structure of the data all that well. I have tried making sigma simpler (just the species level random effects) but there isn’t much of a difference.

So after some searching I found that mixture models could be an alternative approach, which could also make sense as populations can have slow and fast growers. As I cannot explicitly model this in a distreg framework, I thought of letting mixture models do the hard work for me

f2 <- brmsformula(xmid ~ elev_Org_m * treatment * delta_elev + 
                    (1 | gr(phylo, cov=A)) + (1 | species:region), family = mixture(gaussian, gaussian)) 
#you will notice i use gaussian here: only did it as a starting point as maybe it makes things easier
#in general i use lognormal dist. to model my traits as all are restricted to be positive

prior_mix = c(
  prior(normal(0, 5), "b", dpar="mu1"),
  prior(normal(0, 5), "b", dpar="mu2"),
  prior(normal(18, .05), "Intercept", dpar="mu1", lb = (5), ub = (70)),
  prior(normal(40, .05), "Intercept", dpar="mu2", lb = (5), ub = (70)),
  prior(student_t(3, 0, 14), "sd", dpar = "mu1"),
  prior(student_t(3, 0, 14), "sd", dpar = "mu2"),
  prior(student_t(3, 0, 14), class="sigma1"),
  prior(student_t(3, 0, 14), class="sigma2"), 
  prior(dirichlet(10), class="theta")
)

m <- brm(
  f2,
  data = d,
  data2=list(A=A), 
  prior = prior_mix,
  chains = 4, cores = 4,
  iter = 4000, warmup = 1000
)

I read some of the other posts on convergence issues on mixture models, and that’s why my priors for the intercept have such low standard deviations. At least at this stage, parts of the model have good convergence, but other parts don’t. I have very few divergent transitions, high Rhat, and low ESS values.


Looks like I’m a bit closer here. I still think lognormal is the appropriate distribution, but since the log scale is smaller, further reducing SDs for the intercept priors make estimation unnecessarily hard…

I guess essentially this question boils down to mixture models and appropriate model (and prior specification) for them, so maybe I’ve provided too much needless info here (sorry!). So my final question is where to go from here and how to specify the mixture model a bit better and get my models to work?

Thank you in advance, I appreciate any help.

  • Operating System: MacOS (Apple Silicon)
  • brms Version: 2.20.4

@scholz any idea? sorry to tag you here but i’m still lost. thanks in advance!

Hey, no worries re. the ping. Happy to help if I can.
Just for starters, could you give me a little more info on your data? How many observations, and what is the process you are trying to model?

I think a mixture model is a good approach if you can’t fit the bimodality with some kind of multilevel structure. Is there anything in your data (you mention that there is a biological explanation for this) that you could use to model the bimodality? Maybe using a nonlinear model based on that theoretical background?

You also mention that you have a double-bounded outcome if my understanding is correct. In that case you could try a beta likelihood after transforming the outcome to (0,1), however, only if your upper bound is constant accross cases. Won’t solve the bimodality problem, but in my experience, the beta generally samples better than the lognormal, and you would properly account for both boundaries.

What parts of the model are struggling to converge? Depending on how many groups and cases per group you have, it might just be hard to estimate some of the group parameters.

And finally, the priors for the b’s look rather wide. Have you checked prior-predictive checks? (run the model with sample_prior = "only" and run a pp_check on it. It might just be that you need stronger regularization for some model parts to tie it down. Similarly, the intercepts could be too restrictive and not allow the model to explore the parameter space sufficiently.

Hope this helps a little. Let me know how it goes :)

3 Likes

Thanks for the response!

Some more info on the data: I have about 1220 observations in total, which are divided among species and further divided among populations within species. The smallest “group” has on average 18 observations. The trait I estimated is essentially the time individuals take to reach the “midpoint” of their growth.

model the bimodality

Unfortunately I did try to use species as an explanatory variable for a distreg model, as the distribution of the trait within each species looked “unimodal enough”. However, on careful probing some of the distributions are very wide, showing that within a species there’s a wide range of phenotypes. Funnily enough, fitdistrplus::descdist() on my response says it is “uniform”. I even used all my predictors of the trait as predictors for sigma (shown in the original post) - but it doesn’t give me a good fit. I think all the variables I could put into the model which I as the experimenter controlled, went into it already.

nonlinear

While there could be a nonlinear process going on, I have elevation values in km and their interactions will be way too small, making convergence impossible.

beta likelihood

I actually did try the Beta method already! The upper bound for all groups should be the same, as it cannot be longer than the duration of the experiment (they all reach an asymptote in growth). I took the inverse and scaled it to put it in (0,1) and here’s what I got:

f3 <- brmsformula(scaled.inv.xmid ~ elev_Org_m * treatment * delta_elev + 
                    (1 | gr(phylo, cov=A)) + (1 | species:region),
                  phi ~ (1 | gr(phylo, cov=A)) + (1 | species:region),family = brms::Beta()) 
m <- brm(
  f3,
  data = d,
  data2=list(A=A), 
  prior = prior_norm,
  chains = 4, cores = 4,
  iter = 4000, warmup = 1000
)


I was initially quite happy with this - chains converged well, high ESS and Rhat ~ 1. However, loo(m) tells me I have 3 problematic pareto-k values.

Computed from 12000 by 1224 log-likelihood matrix

         Estimate   SE
elpd_loo   1375.8 42.8
p_loo        38.4  6.8
looic     -2751.7 85.6
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1221  99.8%   1117      
 (0.5, 0.7]   (ok)          0   0.0%   <NA>      
   (0.7, 1]   (bad)         2   0.2%   92        
   (1, Inf)   (very bad)    1   0.1%   26        

What parts of the model are struggling to converge? Depending on how many groups and cases per group you have, it might just be hard to estimate some of the group parameters.

Quite a lot of parts actually… only mu1_Intercept has an Rhat of 1, mu2_Intercept with 1.01, rest are all greater. As for the latter, I estimated many traits with similar amounts of groups and cases/group, didn’t encounter any problems there, so I hope this isn’t a problem.

priors for the b’s

I did try prior predictive checks, and now based on your comments after reducing b’s and making intercepts less restrictive here is what I get:

prior_mix = c(
  prior(normal(0.5, 0.5), "b", dpar="mu1"),
  prior(normal(0.5, 0.5), "b", dpar="mu2"),
  prior(normal(18, 2.5), "Intercept", dpar="mu1", lb = (5), ub = (70)),
  prior(normal(40, 4.0), "Intercept", dpar="mu2", lb = (5), ub = (70)),
  prior(student_t(3, 0, 15), "sd", dpar = "mu1"),
  prior(student_t(3, 0, 10), "sd", dpar = "mu2"),
  prior(student_t(3, 0, 14), class="sigma1"),
  prior(dirichlet(5), class="theta")
)


Honestly, when I get these sort of pp_check plots with priors, I am kinda lost. The fact that most of the yreps are flat make me think that my priors aren’t informative enough (is my interpretation correct?). It even looks the same with more restrictive Intercept prios. In general though, these stump me because I don’t know what part of the prior is making them behave this way.

Thanks again :)

1 Like

Regarding the distributional regression, I don’t have experience with it but from what I heard, it makes modeling harder. Do you need to model sigma for your purpose? Otherwise I would default to just leaving it alone.

You can transform your predictors. In this case either just to meters or just z-scale them. If your predictors are on very different scales, this might help the model sample better.

Not that it should matter much, but why did you inverse it?

There is two ways to look at the pareto-k values. First one is that it just tells you if the psis approximation of elpd works well. In this case you would use moment matching of reloo to make the approximation more robust. The other view is that it is a sign for (local) misspecification of the model, as your model is “surprised” by some of the observations. You could take a look at those observations and see if there is something special about them that you can include in the model somehow.
Have you looked at the predictions of the model (eg. via conditional effects with method = predict and points = true) to see if you can learn something useful from it? Overall I’d be generally happy with that fit.

We want the priors to encompass all reasonable datasets in our pp_check. I think yours look kinda ok. I would argue that there seems to be a bit much of an emphasis on very wide datasets, spanning the entire length of your experiment (that might not even respect some of the boundaries of your data, but that is due to the likelihood in this case). You could start with an intercept only model and then add predictors to it, with growing effect sizes, until you get what you are looking for. In terms of what you are looking for, that is kinda hand-wavy. You want datasets that are wider than yours and thinner than yours to be plausible under the priors but you want very little weight on unreasonable ones.
One more trick can be to look at a histogram rather than the densities, as the smoothing for the density estimation can make things appear different than they are.

Overall I would stay with the (0,1) scale, z-scale all your predictors and maybe try around there a little more. Maybe a logit-normal likelihood can catch that missing part a little better than the beta (there’s also Kumarawsamy, simplex, cauchit- and cloglog-normal likelihoods in the Bayesfam package that behave different in some parts so might be worth a try). And re. your model: Do all interactions you have in the model make sense? It could be that parts of the full model are hard to identify if there is little information in there. So working up from a simpler model can help with understanding.

1 Like

z-scale

I did start with that approach, however after some discussions with my supervisor we decided to not do this as we lose a lot of “interpretability” - we prefer real elevation values.

We want the priors to encompass all reasonable datasets in our pp_check. I think yours look kinda ok. I would argue that there seems to be a bit much of an emphasis on very wide datasets, spanning the entire length of your experiment

This is sorta good to know, in my head that’s what I thought prior checks should be - I got thrown off by seemingly “weird” plots when some other prior checks have looked “better”.

Anyway, at this stage I’ve had a bit of time to think and I’m generally satisfied with the Beta model. As for why I inversed, it was the first thing I thought of to rescale: it brings it closer to “rates” and has the bonus of having a good biological interpretation - smaller values mean slower plants and higher values mean faster plants. I just plotted the inverse and the histogram and density plots looked like a nice Beta distribution.

I normally don’t model scale/shape separately if I don’t need to, but for some traits it improves the fit - and it makes biological sense to add it where necessary (I have a multispecific dataset, and species differ in means but also in variances).

I will mark your earlier response as a solution, as I’m going the Beta way. However, I note your other suggestions (re: other distributions, scaling and maybe even modeling predictor by predictor to catch any issues). I will surely revisit this soon, and if things change from my end I’ll make it a point to update this thread. There’s many who might have similar datasets as I do in my field, and I often cannot find highly specific things wrt Stan/brms and my field.

Thank you for the detailed discussion!

1 Like

If you use conditional effects for interpretation, you can just transform the values back before plotting. And at least in the case of the km values, you can transform them to meters without any problems I think.

I think something that is sometimes overlooked is that you most definitely don’t want your prior predictive checks to look like a fitted model, i.e. the prior (-predictive distribution) shouldn’t look like your data. You mainly want to show your model what area of the parameter space is reasonable to look at, which yours does besides not respecting the boundaries and having a tendency for wide datasets imho, something I wouldn’t sweat too much about as you have plenty of data to move the posterior to where it belongs it seems.

Thanks for the effort on your end :)

2 Likes