Is beta regression appropriate for this dataset?

I am hoping to model some pairwise similarity scores between samples in brms. Because of the way these scores have been calculated they can’t be <0.5 or >1. Given that the lowest possible value these scores can take is 0.5, would the beta distribution still be appropriate for this data? I’m aware that the beta distribution is for data bounded by 0 and 1. Also is there anything I would have to think about when specifying priors for this sort of data (pretty new to Bayesian analysis). Thanks!

Two concerns come to my mind in your case. The first is that, as you note, the beta distribution presumes that there is a non-zero probability of observing values between 0 and 0.5. In that way, the model doesn’t align with what you know to be true about your data.

The second is that, for most sets of parameter values (\alpha and \beta), the domain of the beta distribution is (0,1) (i.e. excluding exactly 0 and 1). So if some of your observations are exactly 1, then the model may not fit at all.

The simplest option that addresses both concerns from a technical perspective is to rescale your data such that 0.5 corresponds to 0 (e.g. 2 \times (x - 0.5)), maybe nudging in the end points so that the actual observed range is [0.0001, 0.9999] (2 \times (x - 0.5) \times 0.9998 + 0.0001), or something like that. However, you introduce some arbitrary-ness through the nudging based on the size of the nudge, so you would want to see how sensitive results are to that choice.

If your data are highly skewed such that there are a large number of 1s, then you might be fine fitting the model as-is (that is, the fitted parameter values might be for a form of the distribution where 1 has a valid likelihood). This might be the case if your data are truly beta-distributed but are so skewed that values below 0.5 just aren’t observed in your dataset. I assume that’s not the case here, but it could be a reasonable assumption if the data are highly skewed.

2 Likes

Thanks so much for your help. I’ve rescaled the response variable to be bounded by 0 and 1 as you suggested. This worked well for a smaller dataset, but I am now struggling to get this to converge with a much larger dataset.

The similarity scores were originally pairwise Euclidean distances (which I’ve now converted to similarity scores bounded by 0 and 1). Because I’m only analysing a subset of the distance matrix, the lowest score in the dataset is 0.008 and the highest is 0.79 (but values of 0 and 1 are now theoretically possible). I have tried modelling these similarity scores using a beta distribution, but the posterior predictive checks don’t look particularly good. The distribution for the observed values is narrower/possibly more skewed:

Is this likely to because the beta distribution is not appropriate, or something to do with the way my model is parameterised?

Further details about the dataset/model can be found here https://discourse.mc-stan.org/t/multi-membership-random-effect-has-low-ess/32401
huge apologies for cross-posting…I realised afterwards I maybe should have replied here rather than making a new topic but couldn’t find a way to remove my other post.

I’m glad that was helpful. I’ll try and keep my comments in this thread focused on whether the beta distribution is “appropriate” and leave the discussion of parameterization and effective sample size to your new post.

Your posterior predictive check doesn’t look terrible, at least as far as I’m concerned. There are much worse places to start.

With regards to whether it is “appropriate,” I’m not sure if there is a technical definition for that. It is appropriate in the sense that a) it is unimodal, b) the values are bounded at 0 and 1, and c) it is continuous. But it is inappropriate in the sense that the data derived from your distance matrix were probably not generated from an exact beta distribution. So I think it depends on whether the beta distribution represent the features that are most important to you (e.g. the skew) and if you can interpret the results in a way that has substantive meaning for your particular problem. It wouldn’t be surprising to see someone us a normal distribution because it is easy to interpret and fits well-enough, accepting the fact that it is not bounded at (0,1). Whether that is “appropriate” or not depends on the situation.

As a side note, it might be helpful to know what the original data are that you are generating the distance matrix from.

Another, perhaps bigger, concern here is that as you are modelling pairwise similarity scores, your data points are not independent of one another - each similarity score is calculated as some function of two data points (or sets of data points) f(x, y), and each data point is used to calculate multiple similarity scores. This violates the regression assumption of independent observations, in a way that I’m not sure that multiple-membership models can rescue (though I’m open to being proven wrong on this!). In general, this violation will lead to your uncertainty estimates being too precise, and will get worse the more data points you have.

To avoid this problem, an alternative approach would be to model the variables you used to generate the similarity scores directly, as a function of the variables you are interested in making inference about. This allows you to side-step the non-independence problem. Depending on the exact types of variables you are using, you could also allow for correlations among them, either using set_rescor(TRUE), or using correlated random effects across formulas using the (1 + some_var | id | group) format, or both. Then, you can calculate the similarity function for each of your model predictions, which allows you to properly ascertain the uncertainty in each predicted dissimilarity by using the uncertainty in each variable. You can then compare your predicted similarity scores to your actual scores to see how well your model picks up the similarities.

2 Likes

Thanks both for your replies and insight.

Sorry I should have mentioned that the distance matrix was originally generated using microbiome abundance data. There are hundreds of bacterial species which vary in abundance across samples. These species abundances are compared when computing the distance matrix - i.e. this provides a means of comparing the similarity of microbiome composition across samples. I am interested in analysing whether sample similarity is associated with certain variables (e.g. is sample similarity lower in older age groups because the microbiome becomes more unstable?) whilst controlling for others (e.g. time interval between samples).

I included the multi-membership effect to try to control for the non-independence of pairwise comparisons (as you say, the same sample is involved in many pairwise combinations). This is kind of borrowed from the social network literature and has been applied to small microbiome datasets. But I agree, these parameters do seem to be causing issues in my model possibly because my dataset is much larger.

Because there are so many bacterial species, modelling the abundance of each one separately does not not really seem feasible (if I understand the reply from prototaxites correctly).

1 Like

As a fellow microbiome researcher, I share your pain! Do these data come from 16S rRNA amplicon sequencing or similar, or something else?

It’s definitely possible to model the abundance of each bacterial species, but it’s important to define your question carefully - there are additional difficulties involved in modelling sequencing data, for example, as the read counts are actually measurements of relative abundance, summing to some total number of sequencing reads retrieved for each sample. An increase in relative abundance can result from either an increase in the absolute abundance of a species, or the reduction in the absolute abundances of other species. If you’re willing to ask questions about species presence/absence among samples, the question becomes easier because you can move to a zero-inflation/occupancy model framework, to account for uncertainty in species detection among samples, but it really depends on the structure of your data collection and how much of a problem you think non-detection should be.

But in principle, you can specify a model for relative abundance - either using some kind of multinomial framework (e.g. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis ), or more commonly, using something like a zero-inflated negative binomial/beta-binomial model (e.g. the models described in Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible) - though these will not necesarily predict counts that sum to the total used to predict them. To specify a model like this, you can convert your data to long format - one row per species per sample, and model the count using something like:

brm(bf(Count ~ offset(Total_read_count) + 0 + Intercept + x + (1 + x | Species),
zi ~ (1 | Species),
data = ....., family = zero_inflated_negbinomial())


Which specifies species-specific responses to your variables of interest as a group effect with defined variances/covariances, allowing rare species to borrow information from common species. Note that these types of models can take quite a while to fit on a personal computer/laptop, depending on the size of the dataset - mine commonly take 1+ days, sometimes up to a week for a large model!

Doing that, you could then model similarity either as a function of presence/absence or predicted abundances, and calculate your similarities. I should add that there are conceptual problems with some of the commonly used similarity indices in ecology, so be sure to pick one that’s valid for the question you’re evaluating (see e.g. Compositional similarity and beta diversity or A better index for analysis of co-occurrence and similarity).

2 Likes

Just an additional note that if your goal is to model the (dis)similarities directly, there is some empirical evidence that adequate uncertainty characterization is available via the Bayesian bootstrap. See https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12710

2 Likes

Thanks both for your suggestions- really appreciated!

Yes it’s 16S rRNA amplicon data. The euclidean distance matrix was computed using abundances that were CLR transformed (so technically they’re Aitchison distances) to try to cope with the compositional problem.

The bootstrapping method could work, although i believe GDMs only take continuous predictors as input so I’ll have to think some more about that. I’ll read the papers you suggested prototaxites and see if I could apply your suggested method to my problem.

Essentially I’m trying to do something very similar to the analysis done in Fig 2C of this paper https://doi.org/10.1016/j.cub.2020.10.075. They used a bootstrapped Kruskal-Wallis test stratified by sample id to control for non-independence of pairwise comparisons. I’ve used this method before with just one covariate, but now I want to control for other things (e.g. if sex is the same or different within each pair, time interval between samples, as well as the type of age group comparison for each pair of samples).

1 Like