I am an evolutionary biologist trying to fit a model (brm.count) in which:
the response variable is the count of a phenotypic structure (e.g., toes) that a species possesses (trait.count);
the population-level effects are two continuous variables and their interaction (continuous1 and continuous2);
intercepts can vary over the group-level effect species, and species are correlated according to the phylogeny covariance matrix.
Since the response variable is count data, I opted to choose a Poisson distribution. There is slight overdispersion in the data (index of dispersion, \sigma^2/\mu = 1.1) and 78% of the observations are zeroes. This is how the distribution of counts (trait.count) looks like:
Besides underestimating the number of ones, the model seems to be overestimating the number of zeroes. This is further evidenced by checking the proportion of zeroes in samples from the posterior predictive distribution, compared to the observed proportion.
My question is: how come the dispersion of my data (1.1) is inflated to ~2 by a model that assumes no overdispersion (index = 1)? Shouldnât I expect the opposite: for dispersion to be underestimated? Am I missing something?
(By the way, I also fitted this model with zero-inflated Poisson, negative binomial and Poisson with observation-level random effects (1|obs), and the issue persists.)
Operating System: macOS 10.12.6
brms Version: 2.13.0
Thank you so much for any help!
UPDATE: I ran ppc_stat() with variance and mean to check the source of the overdispersion. Turns out the mean is pretty accurately estimated, while the variance is overestimated (0.3 in the observed data, ~0.6 in the posterior predictive samples).
Thatâs interesting. I canât answer your question but here are a couple thoughts.
The model has a few parts as you explainâyour covariates, the varying intercepts, and the correlation structure for the latter. Have you tried removing parts one at a time and re-evaluating results? I might start with just an intercept y ~ 1, plot with pp_check(), and add parts back in one at a time to see how the predictive distribution evolves.
Another possibility is that the priors are not what youâre expecting. I know brms has good (generally âsafeâ) default priors, but that might be worth looking into as well.
I followed your suggestion, fitting the simplest possible model first, and then gradually making it more complex.
In the figure below, you can see the dispersion of predictive posterior samples from increasingly complex models. The simplest model (~ 1) predicts dispersion exactly as I would expect in a Poisson distribution, with the mean roughly equal to variance (\sigma^2/\mu â 1). Adding a predictor brings it up to 1.1, and adding two predictors brings it up to 1.2 (this is roughly the dispersion in the observed data). The interaction term doesnât seem to affect dispersion, be it with or without the additive terms. The group-level effect + correlation structure, on the other hand, significantly affects it; the dispersion predicted by this model is similar to the one predicted by the full model in the original question (\sigma^2/\mu â 2). Since each observation refers to a species, I just realized that the group-level effect (1|species) might be acting like an observation-level effect, which is frequently used to model residual variance in count data. Might this explain the inflated overdispersion?
I believe the overdispersion in the models is likely arising from overestimated high counts that are not present in the observed data (> 3); the maximum predicted counts (below) are an indication of that. Zeroes appear to be overestimated as a consequence: they keep the mean estimation accurate despite these high counts (it could also be the other way around, of course).
The model in the original question and the simulations above all take flat (default) priors for effects. I tried using more informative priors (normal(10, 0)) for the full model, but things didnât change.
Hopefully this throws some light on what might be going on!
Iâd be curious to see what happens to the pp_check() with y ~ (1|group), dropping the covariance matrix. Iâve been wondering if its the covariance matrix, because correlated data has higher variance than uncorrelated data. Maybe its a problem with your phylogeny data or maybe the correlations just arenât as strong as the phylogenetic structure would suggest.
It makes some sense to me that the maximum value from the predictive distribution is always higher than the observed maximumâthat may be a good thing because it would be highly unlikely to observe the highest value out there in the world if youâre predicting outside the observed set of species.
But Iâve never worked with phylogenetic data! @maxbiostat might have some ideas.
Looks like the covariance matrix indeed adds a lot of variance, as you suspected. What youâre suggesting is that for some reason the phylogeny might not be fully capturing the correlation among data, thus leaving part of the variance unaccounted for?
(By the way, I donât even know if this is an issue at all! I just thought Iâd ask for help because the overdispersion thing was bugging me, but maybe this is expected in Poisson GLMMs with covariance matrix?)
What Iâm wondering is if your data has less phylogenetic correlation than youâre expecting, and thatâs why the predictive distribution doesnât cover the data once you add in the covariance matrixâthough Iâm just going on some rough intuition. For now it looks like the uncorrelated structure is much more plausible than the correlated model.
Have you used the phylosignal R package or something similar? They provide a method for measuring the strength of phylogenetic correlation (using Moranâs I, from spatial statistics). If its really low, that might explain whatâs happening. Iâd also still entertain the notion that thereâs something wrong with the covariance matrix (some bad data or a data cleaning process gone wrong), if I were finding something similar and thought it was really unexpected.
Also (edit): Iâm not very familiar with the phylogenetic models that are implemented here in brms, Iâm trying to pick some of this up but it looks like the amount of variance inflation induced by the correlation structure has to follow the scale parameter for the varying intercepts. That really could explain the strange behavior. The spatial statistics literature addresses this by splitting this into two separate terms, one that is (spatially) structured and the other unstructured, each allowed to have their own scale. Thatâs why at first it didnât make sense to me but looked familiar, cause I work with spatial models.
So I suppose what this means is if you have only modest phylogenetic correlation (or none) but still have substantial variation across species (hence some substantial variation across varying intercepts 1|species ) this model with the covariance matrix could cause problems.
So, my data has considerable phylogenetic signal (i.e. data are phylogenetically clumped), as measured by Moranâs I or any other metric. I was prepared to agree that there must be something wrong with the covariance structure, as you suggested, even though I canât really imagine where it might have gone wrong. The phylogeny itself comes from a widely used source (birdtree.org) and was transformed into a variance-covariance matrix with ape::vcv(), which is standard practice as far as Iâm aware.
But then I decided to test whether phylogenetic signal has some role in inflating variance, as you imagined, by simulating some data. I simulated data under a Poisson distribution in the presence or absence of phylogenetic signal. The Poissonâs lambda is based on a subset of the data in the first post. For each dataset, I fitted a brms model predicted by (1|species) and (1|gr(species, cov = A)). Hereâs what I found:
Turns out that phylogenetic signal does influence variance, but in the opposite direction! The covariance matrix is only inflating variance in the presence of phylogenetic signal (i.e., phylogenetic correlation).
Iâm clueless! Could it be that phylogenetic correlation is so high that the model canât handle it even with a covariance matrix, thereby affecting the modelâs accuracy? But then how come the modelâs prediction is spot-on without the covariance matrix, right?
Its not clear to me how you simulated the two data sets but one thing I notice is that the histograms for the data with and without signal look the same. If youâve managed to simulate an outcome with strong autocorrelation I would expect it to have more zeros and some more large numbers, relative to the data without that signal.
Out of curiosity, whatâs the value of the Moranâs I for your data? Then, if you take the posterior mean of the varying intercepts for your model, whatâs the Moranâs I for that vector (with and then without the covariance matrix)?
I feel a bit stumped, but that birdtree website is pretty cool
I simulated the no-signal data set (I = -0.01) by simply drawing values from a Poisson distribution and randomly attributing them to species.
The with-signal data set I didnât really simulate, I just manually manipulated a few original values so that the proportions would be equal to the no-signal data set, but with a higher I (0.14). (This is a manipulated subset of data; the original, complete data's I was 0.31.)
Doing as you instructed, âposteriorâ Moranâs I is 0.26 for the with-signal data with the covariance matrix, and 0.14 without the matrix.
A similar pattern is found for the no-signal data: posterior I is 0.31 with the matrix and -0.01 without it. But it doesnât seem to affect the model accuracy as assessed with pp_check() in this case.
It looks like youâre able to identify that when you add that covariance matrix into the mix, the autocorrelation in the varying intercepts is increasing beyond whatâs present the data, and by quite a bit. That matches what you found to begin with on the first post, and is close to explaining it, since the autocorrelation would create the variance inflation, pushing the predictions away from the observations. It doesnât seem to me that youâre doing something wrong; I think youâre finding that the model really isnât right for your data.
A similar pattern is found for the no-signal data: posterior I is 0.31 with the matrix and -0.01 without it. But it doesnât seem to affect the model accuracy as assessed with pp_check() in this case.
The second observation you make in that quote above still seems strange, maybe more so now. But maybe not, since the variance inflation is going to be a product of both the degree of correlatedness (in this case its kind of fixed by the matrix you pass in, right?) and the scale of the varying intercepts. Drawing on spatial models, you can have an autocorrelation component that has very high autocorrelation yet have small scale, in which case it will have minimal impact on the predictions. That dynamic might be relevant here (because its being restricted), but it looks like both the sample datasets youâre working with there have same or similar scale.
Iâm surprised this hasnât come up before, though maybe it has somewhere else.
At least relative to what Iâm used to seeing in spatial data, those are fairly low values of Moranâs I. 0.31 is maybe âmoderateâ but 0.14 is quite low.
Iâve been reading and re-reading your answer since yesterday, and I think it makes total sense. But I kept coming back to the fact â which you also found strange â that pp_check() indicated that both models were just as accurate, even though one of them got the autocorrelation so wrong. This made me wonder: might pp_check(), by itself, simply not be a good enough diagnostic in my case?
Hereâs what I thought: imagine if I draw values from a Poisson distribution (with lambda based on the observed dataâs mean) and randomly attributed to the species in my data set, just like I did to generate the simulated no-signal data set. Now imagine this vector is a predictive posterior sample from a model fitted to my data. This is the histogram of the âpredictedâ data, compared to the observed data (the equivalent of a pp_check()):
Not too bad, right? But if we look at how accurately each value was âpredictedâ, we see that only 56% of them were accurately
âpredictedâ, with the remaining 44% differing from the observed data by 1 or 2. This goes to show that an overall good-looking predictive distribution may mask within-observation inaccuracies.
So, I decided to look at potential within-observation inaccuracies in the posterior predictive samples of y ~ (1|species) and y ~ (1|gr(species, cov = A)). Hereâs the result:
Turns out that, even though the overall distribution looks worse when the covariance matrix is added (upper panels in post #7âs figure), the model is more accurate at predicting the count of specific observations (species). In the case of the predictive samples used for the figure above (one per model), 71% of the values are accurately predicted with the covariance matrix, versus 57% without it.
Of course, all of the rest stands â itâs still strange that variance is inflated, to which your explanation sounds like a likely one. But maybe the model with covariance matrix is not as bad as it seemed at a first glance.
Thanks for sharing, I think you have sound logic here but lets try to specify what that is and follow through with it.
I see a couple strands of thinking coming together. One is about predictive accuracy/error. Here youâve pulled out one sample from the posterior predictive distribution from each model and done a comparison. But thatâs just a sample of one, so youâll want to repeat that some number of times. So why not put that into ppc_stat and see what you get?
The other element here is the look of the observations relative to the predictive distribution, in your pp_check() plots. When we see the data falling somewhere outside the range of the model predictions, we get the sense that the model is systematically wrong in some sense, so that the data are improbable realizations of the model. The probability that these data arise from that model appears to be low, is the idea.
It seems like there should be some correlation between these two concepts. Hereâs how the loopackage describes the Widely Applicable Information Criterion:
Leave-one-out cross-validation (LOO-CV, or LOO for short) and the widely applicable information criterion (WAIC) are methods for estimating pointwise out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter values.
The reference to the log-likelihood is the probability of the data conditional on the fully specified model and all its estimated parameters: say you have a Poisson model with \mu_i = exp( \alpha + x_i \beta ), then the likelihood of an observation is Poisson( y_i | \mu_i); but we have to repeat for all M posterior samples in order to see results across the joint posterior distribution, so every observation gets all these likelihoods Poisson(y_i | exp[\alpha_j + x_i \beta_j]). Then repeat for all observations tooâŚso I think the pp_check() is like a shortcut that visualizes something like this concept for us. If you have predictions sitting far away from observations, like in a density overlay, youâre seeing points with low likelihood.
So if you havenât done so already, Iâd also try WAIC on the models youâre considering for your data and see what happens.
I will answer/comment on the topics you brought up below, but I want you to feel comfortable to just not respond if you feel like it. Youâve already been so helpful and I donât want to impose on your goodwill.
I ran ppc_stat() with the percentage of correctly predicted observations (as per the function pct_correct() below) for each of the models. I wasnât sure what metric would best summarize this âwithin-observationâ accuracy; let me know if you think of a better one. The model that includes the covariance matrix does seem to be consistently more accurate if we look at the % of correctly predicted observations, though:
I had come across LOO and WAIC reading brms vignettes, but hadnât quite understood the underlying logic, so thank you! I hadnât tried using these methods with my data because I felt that my issue precedes it. That is, both candidate families for my data (Poisson and negative binomial) produced strange-looking predictive posterior distributions, so I felt that there was no point comparing their (apparently bad) fit to data before resolving this issue.
But I hadnât thought of looking at what these methods would tell me about the covariance matrix dilemma. I ran loo::waic() with both models (brms_sim and brms_sim_gr from the previous post = the ones used in ppc_stat() above), and then loo::loo() with reloo = TRUE as suggested by WAICâs output.
loo(brms_sim_gr, brms_sim, reloo = TRUE)
Both methods produce very similar outputs that indicate that the model with the covariance matrix is a better fit (as seen in Model comparisons at the bottom of the output):
Output of model 'brms_sim_gr':
Computed from 4000 by 136 log-likelihood matrix
Estimate SE
elpd_loo -80.9 7.6
p_loo 9.1 1.5
looic 161.9 15.3
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 121 89.0% 154
(0.5, 0.7] (ok) 15 11.0% 646
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
Output of model 'brms_sim':
Computed from 4000 by 136 log-likelihood matrix
Estimate SE
elpd_loo -110.7 8.1
p_loo 5.7 0.8
looic 221.4 16.2
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 127 93.4% 1519
(0.5, 0.7] (ok) 9 6.6% 1952
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>
All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
Model comparisons:
elpd_diff se_diff
brms_sim_gr 0.0 0.0
brms_sim -29.8 2.9
ah it dawned on me as soon as i saw your reply that I probably gave an unhelpful suggestion, with using pp_stat in that case.
But with waic, I meant that you could use waic or loo not for the simulated data but for model comparison with your real data. As in, you might fit the 1|group model and then refit after adding in the cov matrix, then apply waic or loo to each to compare the two models for your data.
First - this is a great thread. Good questions asked, sensible answers given, lots of work invested in getting things right from everybody. Pleasure to read! Iâll try to add some of my thoughts:
In my undrestanding LOO (or the XXIC crowd) metrics provide very different information than PP checks - although there definitely is some correlation at least in that a bad model is likely to fail PP checks AND have bad LOOIC. PP checks are a qualitative measure and can tell you if your model is systematically wrong/overconfident in some parts of the data. The model can have small error between prediction and data, but underestimate the error and thus fail a PP check, or you could have a model that has large error, but the error is not systematic and the magnitude of the error is estimated accurately, which will pas PP check just nice.
LOO on the other hand tries to quantitatively estimate predictive performance on yet unseen data. I understand LOO less well than PP checks, but while it will penalize models for being overly certain (underestimating error), it will also penalize models for being overly uncertain or for being heavily influenced by few individual observations (thatâs what the âleaving one outâ part is for).
In the case here, I think PP checks are a better tool to understand what is happening, because they let you bee specific. I would also potentially look into doing the _grouped version of some of the PP checks, grouping by species or some intermediate taxonomical unit (with some random smaller subset of species to keep the plot legible). Does the pattern repeat at lower levels? Or are there taxonomic groups that are systematically under/over estimated?
I also think the problem might really be in the response distribution: Poisson and probably also neg. binomial (not sure here) might be a terrible model of count of toes or similar. The problem with Poisson is that if you have a lot of zeroes and some ones, your really have to give quite high probability to quite high numbers. I think neg. bionomial is less restricted in this sense but once again, it might not be able to handle the sharp drop from the ones and twos to basically non-occuring values > 2 (and no occurences for > 3).
In other words, you AFAIK basically canât have Poisson or neg. binomial give high probability to 1 or 2 while also not giving noticeable probability to 3 or 4 (because their variance is at least the mean). I think this is what your model struggles with: to fit the absence of 3s and above it needs to decrease the mean which leads to overestimating the zeroes.
I am less sure what a good distribution for this case might be⌠I guess zero-inflation is likely, because the causes of not having a trait at all seems likely to differ from the causes driving changes in the number of the trait. If you can define some sort of âopportunityâ for a trait than a (zero inflated?) binomial distribution might not be terrible - i.e. if I am interested in the number of toes used for walking than modelling it as binomial with the total number of toes as the number of trials and number of toes used for walking as successes might be sensible (sorry if this is bad example, my understanding of biology is quite limited). Alternatively maybe there is a different source for defensible strict upper bound on the number of the traits making the binomial more attractive.
I also find it likely there is some literature, potentially quite old, for good distributions of these kind of quantities. There is huge literature on âbestâ distributions for various data. Fun fact: in older papers (I guess 80s and before) they often use the term âlawâ instead of âdistributionâ - so this might help your googling :-). You might want to look for some âunderdispersedâ distributions (my guess is that low overdispersion in the total data means that after accounting for covariates the response is almost certainly underdispersed). Quick Googling turned out stuff like this:
Many of the suggestions would however require you to implement the distribution yourself and my experience with trying to implement some _over_dispersed Poisson variants from the literature in Stan is that this can be a challenge of its own (Bayesian inference has somewhat bigger requirements on the distributions behaving âneatlyâ than many methods for frequentist estimation). Depending on how big penalty you place on being wrong, it might be the case that the effort required is not worth the (uncertain) improvement in model fit.
You might even want to try and ordinal response (see the cumulative family) - but once again, this forces you to set a strict upper bound. Or you could even fit a continuous positive distribution (gamma/lognormal) to (count + 0.5) - those should allow for this âunderdispersionâ so you could be getting a less theoretically appealing model but with - in some sense - better fit (heretical, I know :-)). Once again your circumstances should determine if that is a trade-off worth taking.
Finally, you can always fit multiple models and if the important inferences stay the same, you donât have to worry much which of them is âmostâ appropriate (see also: multiverse analysis).
LOO on the other hand tries to quantitatively estimate predictive performance on yet unseen data. I understand LOO less well than PP checks, but while it will penalize models for being overly certain (underestimating error), it will also penalize models for being overly uncertain or for being heavily influenced by few individual observations (thatâs what the âleaving one outâ part is for).
key being the distinction between performance on unseen data and in sample error, which is where I gave probably poor guidance
I was automatically going with the simulated data because there are fewer observations meaning that models are much faster to fit, but youâre absolutely right that I should be looking at the real data. From here on in this post, I will always be referring to the real data (i.e., trait.count in the first post).
So, hereâs the model comparison of loo() for models with and without the matrix. Like with the simulated data, LOO indicates that the model with the covariance matrix (brm_count_cov) is a better fit for the real data:
Model comparisons:
elpd_diff se_diff
brm_count_cov 0.0 0.0
brm_count_gr -231.8 11.1
Iâm also enjoying and learning a lot from @cmcdâs and now your excellent remarks. Thank you for chiming in, @martinmodrak!
I canât group by species because there is only one observation per species, but hereâs how PP check looks for each of the main four orders (order > family > species) that collectively comprehend 650 of the 736 species in the data set:
(This is the model that includes the covariance matrix)
Apparently data are not too Poisson-y for Galliformes and consequently PP check looks terrible for this particular order. Is this a problem? I mean: is there a problem if subsets of a data set do not follow a particular distribution, even if the data set as a whole does?
This is interesting because I think what happens when I do not include the covariance matrix is precisely that there is large error between prediction and data, but the error is not systematic. This would explain why this model passes PP check but not LOO.
This is easy to visualize when I plot the data onto the phylogenetic tree. In the figure below, each tip of a branch is an extant species (present days). The ancestral of all these species is the node at the center of the tree (millions of years ago). Extant species are more related (and assumed to be correlated) the longer they share a branch â this information is the basis of the covariance matrix. The 1st ring around the tree is the observed counts, the 2nd and 3rd rings are the mean predicted counts across 100 samples without and with the covariance matrix, respectively.
You can see that, on average, the with-matrix model (3rd ring) does a fairly good job in predicting a particular speciesâ count, whereas the no-matrix model does a terrible one. In the latter case, the mean count is almost always 0 â which I interpret as evidence that data are predicted in the right proportions (as evidenced by PP check) but regardless of the phylogenetic structure. Since the proportion of zeroes is ~80% (figure in post #5), each species would be predicted a count of 0 in ~80% of the simulations (I hope this makes any sense).
I absolutely agree with this interpretation, and I think it is somehow related to what I was trying to say here:
But then again, if it is an inherent incompatibility between the distribution and my data, how come the model handles it just fine when I donât include the covariance matrix (e.g., figure in post #5)?
Thatâs on me for giving such a terrible biological example (no footed species has 0 toes!). The actual trait I am investigating are spurs, bony projections present in the feet or wings of some birds (two examples here). It is absolutely plausible that different causes are driving absence/presence and number of spurs. The conditions you suggested for using a binomial distribution do not seem to apply, though.
That is a very interesting thought that hadnât crossed my mind. Iâm running a model with the COM Poisson distribution that you mentioned, but itâs going to take a while. Fingers crossed!
Thanks for this excellent tip! I just did some googling about multiverse analysis. I wasnât aware of this term but loved it! I think I was already applying the same principle to other methodological aspects in my study, but I was calling it âsensitivity analysisâ.
A note of caution - those PP plots are not showing the response distribution, but rather a mixture of the responses over all the members of the group. So I wouldnât overinterpret the plots looking âPoisson-yâ. Similarly, saying that âthe datat set as wholeâ follows some distribution is a bit problematic. What we see is whether the data are fit well by the whole model (including, but not limited to the response distribution).
I think it usually is a problem when only a subset of data is not fit well by the model and it might be useful to think why. In this case, I think the cause is related to the response distribution - as orders other than Galliformes basically have a lot of zeroes and much fewer ones and very little larger numbers. So the Poisson mean for the individual species is likely close to 0. For Galliformes, the mean for at least some species is probably around 1 and that brings the conflict around fitting the underdispersion (low number of both zeroes and twos).
You also see that Charadriiformes also arenât fit that well.
This is because pp checks judge the whole model, not only the response distribution - the response distribution must be judged as conditional on all the covariates. I would also expect that if you broke down the figure in post #5 by families or similar, you would see additional discrepancies. Alternatively, it might also be possible that the genetic covariance for some reason doesnât do a good job of modelling the interdependencies and introduces some bias (it could still lead to better predictive performance as evidenced by LOO)
Also a wild guess: a Poisson hurdle model (hurdle_poisson) might also be a good fit as it treats zeroes as completely separate from non-zeroes, so it should allow to have low number of zeroes (as in Galliformes) and a Poisson-like tail⌠(unlike zero-inflated poisson which - as the name suggests - canât have fewer zeroes than Poisson).