Multi-membership random effects model for distance matrices

I am working on a model using brms to study ecosystem stability (response variable: cv_functioning_inverse ). The data is derived from pairwise distances between sites, and I aim to account for the shared contributions of sites in these pairwise relationships but also try to correct for non-independence of the observation.

Data Context:

  • Response Variable: cv_functioning_inverse (scaled inverse coefficient of variation for ecosystem functioning).
  • Predictors:
    • beta_turnover_abund : Beta diversity.
    • multivariate_dispersion : Environmental distance.
  • Distance Matrix Context:
    • Sites are represented in pairs (site_1 , site_2 ), and the response variable reflects ecosystem stability between these pairs.
    • Due to the symmetric nature of the distance matrix, sites can appear as site_1 or site_2 randomly.
    • To avoid introducing bias, I randomised the order of site_1 and site_2 during sampling.

Here is a simplified mock example of the dataset:

site_1 site_2 beta_turnover_abund multivariate_dispersion cv_functioning_inverse
A B 0.5 0.8 0.6
B E 0.7 0.9 0.4
B A 0.5 0.8 0.6
…

(Note: The order of site_1 and site_2 is randomised to avoid bias in directional relationships.)

Model:

model_formula_mm <- bf(
  scale(cv_functioning_inverse) ~ 
  scale(beta_turnover_abund) * 
  scale(multivariate_dispersion) +
    (scale(beta_turnover_abund) * 
    scale(multivariate_dispersion) | mm(site_1, site_2))
)

fitted_model_mm <- brm(
  formula = model_formula_mm,
  data = data,
  prior = priors,
  chains = 4,
  iter = 2000,
  cores = 4,
  seed = 123  # Set seed for reproducibility
)

Problem:

When running posterior predictive checks (pp_check ), the predictions look off compared to the observed data.

Questions:

  1. Model Appropriateness:
  • Does using mm(site_1, site_2) for the random effects structure appropriately account for the shared contributions of site_1 and site_2 ?
  • Is there a better way to model pairwise relationships like this?
  • I would like to keep a regression style model as most research is using multiple regressions or correlation analysis.
  1. Odd Results:
  • What diagnostics should I focus on to troubleshoot poor posterior predictive checks? Could the issue arise from:
    • Mis-specified priors?
    • Over-complex random effects structure?
    • Scaling of predictors and response?

Additional Details:

  • Number of sites: 43
  • Number of pairwise observations: ~4200 (randomly drawn)

I would greatly appreciate any insights on improving model fit and diagnosing issues with posterior predictive checks. Thank you in advance!

Additional Output:

> loo_results <- loo(fitted_model_mm, reloo = TRUE)
> loo_results

Computed from 4000 by 4200 log-likelihood matrix.

         Estimate    SE
elpd_loo  -4199.1  61.3
p_loo       163.2   7.0
looic      8398.2 122.7
------
MCSE of elpd_loo is 0.3.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.4, 1.6]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

> fitted_model_mm
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: scale(cv_functioning_inverse) ~ scale(beta_turnover_abund) * scale(multivariate_dispersion) + (scale(beta_turnover_abund) * scale(multivariate_dispersion) | mm(site_1, site_2)) 
   Data: data (Number of observations: 4200) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Multilevel Hyperparameters:
~mmsite_1site_2 (Number of levels: 43) 
                                                                                        Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                                                                               0.80      0.10     0.63     1.03 1.00     1083     1492
sd(scalebeta_turnover_abund)                                                                0.84      0.10     0.67     1.07 1.00     1216     1974
sd(scalemultivariate_dispersion)                                                            0.65      0.09     0.50     0.86 1.01     1314     2307
sd(scalebeta_turnover_abund:scalemultivariate_dispersion)                                   0.80      0.11     0.61     1.04 1.00     1047     1857
cor(Intercept,scalebeta_turnover_abund)                                                    -0.03      0.15    -0.34     0.27 1.00     1178     1512
cor(Intercept,scalemultivariate_dispersion)                                                 0.10      0.16    -0.23     0.41 1.00     1392     1961
cor(scalebeta_turnover_abund,scalemultivariate_dispersion)                                  0.01      0.16    -0.30     0.34 1.00     1184     1299
cor(Intercept,scalebeta_turnover_abund:scalemultivariate_dispersion)                       -0.58      0.11    -0.76    -0.34 1.00     1566     2027
cor(scalebeta_turnover_abund,scalebeta_turnover_abund:scalemultivariate_dispersion)         0.28      0.15    -0.05     0.56 1.00     1322     1944
cor(scalemultivariate_dispersion,scalebeta_turnover_abund:scalemultivariate_dispersion)     0.05      0.19    -0.31     0.41 1.00     1213     1846

Regression Coefficients:
                                                      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept                                                -0.13      0.13    -0.38     0.11 1.01      840     1264
scalebeta_turnover_abund                                  0.17      0.13    -0.10     0.42 1.01     1084     1128
scalemultivariate_dispersion                             -0.53      0.11    -0.73    -0.32 1.00     1012     1610
scalebeta_turnover_abund:scalemultivariate_dispersion     0.20      0.13    -0.05     0.47 1.00     1114     1597

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.64      0.01     0.63     0.66 1.00     4705     2961

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).


1 Like

From your plots, it seems more likely to be an underpowered likelihood. I don’t now brms so can’t really help at that level. You need to find a likelihood that has parameters that will let you reproduce the data for observed y.

1 Like

I don’t fully understand what your response variable is or how it’s computed, so be aware that some of what I say might not make sense in your specific context.

First, I agree with what @Bob_Carpenter says above: it looks like your model is failing to capture some features of the data, and the way forward is often to incorporate more covariates, different functional forms, and/or different response distributions, which is what Bob means by “finding a likelihood that has parameters that will let you reproduce the data”.

Second, as you allude to in your post, problems arise when dealing with certain forms of pairwise data due to partial non-independence in the response. The canonical example in ecology is when the response represents a dissimilarity measure between two communities, but maybe your response variable suffers from some of the same problems. Since I don’t know what your response variable means or how it’s computed, I’ll focus comments here on the case where the response is a dissimilarity measure. The issue is that the generative models that we might right down for a set of dissimilarities tend to admit configurations of the response data that are logically impossible. For example, if community A is highly similar to both community B and community C, then it is impossible that the dissimilarity between B and C would be large. Even if the proposed generative model doesn’t yield logically impossible configurations, it may concentrate nearly all of the prior mass into an unintentionally limited region of the space of all allowable configurations. When we treat dissimilarity values as responses in a regression, it is challenging to account for this dependency. I haven’t closely followed the ecological literature on the topic over the last 2-3 years, but as far as I know we still don’t have a good generative model that we can write down to directly model dissimilarity data in a way that respects the dependencies inherent in the data. Statistical alternatives to a generative model have included null models constructed via permuting rows and/or columns in the species/site abundance (or presence) matrix (e.g. Mantel tests and related procedures), models fit to sparse re-samples of the full set of pairwise comparisons (e.g. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1461-0248.2012.01815.x) and models with uncertainty quantified via the bayesian bootstrap over sites (e.g. https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12710). Above I say that we “don’t have a good generative model that we can write down to DIRECTLY model dissimilarity data”, because ideally it could be possible to fit a generative model to the underlying compositional data, and recover the pairwise dissimilarities and their expected relationships to differences in covariate values post hoc as generated quantities.

Third, I notice that in your example data you happen to have reproduced two rows that are identical except for the order of the sites. If these rows are just identical by happenstance then there might not be any problem, but if they are identical by construction (i.e. due to nonindependence) then this is almost certainly a problem and you’ll want to remove one or the other from the model.

I might be misunderstanding what you mean by 4200 randomly drawn observations. For 43 sites the possible pairwise combinations are 903 link. Aren’t you overdrawing?

Also have a look at Distance matrix regression. I believe your issue is discussed and solved there.

Thanks for all the rapid responses.

@Bob_Carpenter I tried using beta regression (by transforming the response to a 0-1 scale), and it seems to fit much better, while the results do not change substantially.

@jsocolar to clarify the response and predictors between site variables. I.e. the

I acknowledge that spatial autocorrelation can be an issue in such a dataset. However, the experimental setup was specifically designed to avoid that. However, I will try out the options you mentioned in your reply!

To your third point, no, this was not a mistake but intentional by construction. I wanted to ensure that there wouldn’t be any directionality issues when transforming a distance matrix into a data frame. Here’s an example:

> dist(mtcars$qsec[1:5])
     1    2    3    4
2 0.56               
3 2.15 1.59          
4 2.98 2.42 0.83     
5 0.56 0.00 1.59 2.42

If I were to convert this distance matrix into a data frame with identifiers for rows and columns (e.g., row_id and col_id), the structure would become asymmetric. For example:

  • The first element (corresponding to 1) would never appear as a row_id because it’s only in the first row of the matrix.
  • The last element (corresponding to 5) would never appear as a col_id because it’s only in the last row of the matrix.
  • Other elements, like 2 and 3, may appear as both row_id and col_id, but only in specific contexts.

This imbalance arises because the distance matrix inherently stores only the lower (or upper) triangle, which means that certain identifiers won’t be represented symmetrically in the data frame. To ensure symmetry and avoid directionality issues, I constructed the transformation in this way deliberately.

As @amynang pointed out, I am oversampling; I should just use both the upper and the lower triangle instead if my assumptions that the directionality affects the random effect are correct.

Do you think I worry too much about this?

Additionally, I tried it out with some dummy data, and it does not seem to make a difference
→ duplicate cases are handled
→ using one half does not seem to be an issue.

Results dummy data

> summary(fit_lower)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ predictor + (predictor | mm(row_id, col_id)) 
   Data: dist_lower (Number of observations: 4950) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Multilevel Hyperparameters:
~mmrow_idcol_id (Number of levels: 100) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.03      0.02     0.00     0.07 1.00     1096     1199
sd(predictor)                0.03      0.02     0.00     0.06 1.00      494      985
cor(Intercept,predictor)    -0.11      0.57    -0.96     0.94 1.00      619      822

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.00      0.01    -0.03     0.02 1.00     2742     1257
predictor     2.00      0.01     1.98     2.02 1.00     2568     1560

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.50      0.00     0.49     0.51 1.01     3909     1267

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

> summary(fit_upper)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ predictor + (predictor | mm(row_id, col_id)) 
   Data: dist_upper (Number of observations: 4950) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Multilevel Hyperparameters:
~mmrow_idcol_id (Number of levels: 100) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.04      0.02     0.00     0.09 1.00      458      607
sd(predictor)                0.02      0.01     0.00     0.05 1.00      565      864
cor(Intercept,predictor)    -0.28      0.57    -0.98     0.89 1.00      990     1088

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -0.01      0.01    -0.04     0.01 1.00     3322     1480
predictor     2.00      0.01     1.98     2.02 1.00     3316     1346

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.50      0.01     0.49     0.51 1.00     3108     1344

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
There were 3 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

> summary(fit_both)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ predictor + (predictor | mm(row_id, col_id)) 
   Data: dist_both (Number of observations: 9900) 
  Draws: 4 chains, each with iter = 1000; warmup = 500; thin = 1;
         total post-warmup draws = 2000

Multilevel Hyperparameters:
~mmrow_idcol_id (Number of levels: 100) 
                         Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sd(Intercept)                0.01      0.01     0.00     0.04 1.01      740      821
sd(predictor)                0.01      0.01     0.00     0.03 1.01      557      805
cor(Intercept,predictor)    -0.19      0.58    -0.98     0.91 1.01      694      826

Regression Coefficients:
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept     0.00      0.01    -0.01     0.02 1.00     3595     1175
predictor     1.99      0.01     1.98     2.01 1.01     2766     1286

Further Distributional Parameters:
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.50      0.00     0.49     0.50 1.00     3083     1411

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Warning message:
There were 3 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup 

Code


# Load libraries
library(brms)
library(dplyr)

# Set seed for reproducibility
set.seed(123)

# Simulate data: Distance matrix for 10 sites
n_sites <- 100
dist_matrix <- as.matrix(dist(rnorm(n_sites)))  # Random distances
rownames(dist_matrix) <- colnames(dist_matrix) <- 1:n_sites

### Extract lower triangle
lower <- which(lower.tri(dist_matrix), arr.ind = TRUE)
dist_lower <- data.frame(
  row_id = lower[, 1],
  col_id = lower[, 2],
  distance = dist_matrix[lower]
)

# Simulate response and predictor for lower triangle
dist_lower <- dist_lower %>%
  mutate(
    predictor = distance + rnorm(n(), mean = 0, sd = 0.2),  # Predictor correlated with distance
    response = 2 * predictor + rnorm(n(), mean = 0, sd = 0.5)  # Response correlated with predictor
  )

### Extract upper triangle
upper <- which(upper.tri(dist_matrix), arr.ind = TRUE)
dist_upper <- data.frame(
  row_id = upper[, 1],
  col_id = upper[, 2],
  distance = dist_matrix[upper]
)

# Simulate response and predictor for upper triangle
dist_upper <- dist_upper %>%
  mutate(
    predictor = distance + rnorm(n(), mean = 0, sd = 0.2),
    response = 2 * predictor + rnorm(n(), mean = 0, sd = 0.5)
  )

### Combine both triangles
dist_both <- rbind(
  dist_lower,
  dist_upper
) %>%
  mutate(
    predictor = distance + rnorm(n(), mean = 0, sd = 0.2),
    response = 2 * predictor + rnorm(n(), mean = 0, sd = 0.5)
  )

### Inspect the data
print(dist_lower)
print(dist_upper)
print(dist_both)

table(dist_lower$row_id)
table(dist_upper$row_id)
table(dist_both$row_id)

### Fit models for each case

# 1. Lower triangle
fit_lower <- brm(
  response ~ predictor + (predictor | mm(row_id, col_id)),
  data = dist_lower,
  family = gaussian(),
  cores = 4,
  iter = 1000
)

# 2. Upper triangle
fit_upper <- brm(
  response ~ predictor + (predictor | mm(row_id, col_id)),
  data = dist_upper,
  family = gaussian(),
  cores = 4,
  iter = 1000
)

# 3. Both triangles
fit_both <- brm(
  response ~ predictor + (predictor | mm(row_id, col_id)),
  data = dist_both,
  family = gaussian(),
  cores = 4,
  iter = 1000
)

### Summarize and plot results for each case
summary(fit_lower)
pp_check(fit_lower)
loo(fit_lower)

summary(fit_upper)
pp_check(fit_upper)
loo(fit_upper)

summary(fit_both)
pp_check(fit_both)
loo(fit_both)

The way you generate your dummy data, the information on the upper and lower triangle is not identical. If it was, fitting the same model with the same seed should produce identical output. Compare ranef(fit_lower) with renef(fit_upper) with your real data. The model does not take into account if a site comes first or second in a multi-membership term. There is no directionality to worry about.

1 Like

All of the issues on this thread have to do with with the non-independence of the data input rows.

  1. You cannot simply duplicate rows, by resampling or otherwise, any more than you can duplicate or upsample the data in an ordinary linear regression and expect to still get well calibrated inference. This is a big no-no. Fortunately, as mentioned above, there is no directionality in your model to worry about. Just use one row per entry in the lower triangle of the matrix.
  2. Even without duplicating rows dissimilarity matrices have a complicated dependency structure. Unfortunately, most solutions mentioned on this thread and in other linked threads, including multi-membership models, do not adequately account for the dependency structure. In the event that the matrix is a distance matrix, the key structure is that the response is guaranteed to satisfy the triangle inequality. In the even that the matrix is a dissimilarity matrix, the structure is similar but much more complicated, depending on the dissimilarity metric. If it’s a point of contention, I am happy to expound on why multi-membership models are seriously inadequate to capture the dependency structures inherent when the response is a non-sparse distance/dissimilarity matrix.
  3. When the response data in a regression have a complicated dependency structure, it can be important to capture this, and failing to do so can lead to bad inference. @bschrobru , I think you (rightly) care about this issue since you mention accounting for the non-independence of the pairwise observations in the second sentence of your original post.
  4. However, it is not clear (to me at least) what the response data represent, and whether they have a difficult dependency structure or not, and if so what the form of this dependency takes and whether or not it would be adequately modeled by a multi-membership term. Note that it does not matter if the covariate data have a complicated dependency structure–what matters is the response. To say anything further, I would want to understand how the response data is computed for a given pair. You mention that it has to do with a coefficient of variation, but AFAIK the coefficient of variation meaningful for univariate data with more than 2 entries, and so I don’t understand what this means in a pairwise context.
2 Likes

I’d certainly be grateful if you could say more about this.

I am also a bit out of touch with the literature but to me the multimembership approach seems similar to the “BetaBayes” approach described here.

Thanks again for your detailed comment. I would be very interested if you could expand on the topics mentioned above.

Explanation of the variables:

The response cv_functioning_inverse is the CV of the ecosystem functioning values of both sites; therefore, it is calculated from two values. Thus, it is the relative difference of ecosystem functioning corrected for the magnitude of ecosystem functioning.
If we have two sites with EF of 4g and 7g, the CV for those sites would be ~.28, if two sites have 40g and 70g, it would also be ~.28
The CV or its inverse is commonly the measure of ecosystem stability across time points or places. Alternatively, I could also calculate the relative difference.

Does this violate the triangle inequality?

I am not sure if I grasp what you mean by the dependence structure of the response variable.

beta_turnover_abund → similar to bray-curtis
multivariate_dispersion → euclidian distance between the environment

Non-indipendence of observation

When the response data in a regression have a complicated dependency structure, it can be important to capture this, and failing to do so can lead to bad inference. @bschrobru , I think you (rightly) care about this issue since you mention accounting for the non-independence of the pairwise observations in the second sentence of your original post.

This is correct. I was looking for a way to calculate the interaction effect of environmental distance and beta diversity on relative functioning differences. As I can calculate this for any pair of the community, I end up with three distance matrices. I am especially interested in the interaction effect of environmental distance and beta diversity on functioning differences.

There seemed to be no “standard” method of doing this, so I resorted to a mixed regression model with random intercept and random slope.

bf(
  scale(cv_functioning_inverse) ~ 
  scale(beta_turnover_abund) * 
  scale(multivariate_dispersion) +

    (scale(beta_turnover_abund) * 
    scale(multivariate_dispersion) | mm(site_1, site_2))

Please let me know if this is feasible or if I should take another route.

I’m skeptical of whether and how often the BetaBayes approach adequately handles the issues to yield well calibrated inference. It’s easiest to see the problem by considering scenarios where sites are perfectly similar and/or dissimilar, which is a little bit awkward because the data are on the boundary of the response space. If you’re willing to bear with me and ignore the boundary issues for a moment, I think we can build the key intuition with a few examples.

First, consider the case were we observe a sample of 10 sites from some population of sites, and we want to know what the average population dissimilarity is, and we find that the 10 sites are all identical. If we naively treat all the pairwise dissimilarities as independent, then we would suppose that we have a sample size of 45 dissimilarities of zero, and we would tightly constrain our estimate of the mean dissimilarity to be very close to zero. The problem is that we only have 9 independent estimates of zero dissimilarity, not 45. From the time that we observe that each of sites 2 through 10 is identical to site 1 (9 observations), we gain no new information from observing that any of sites 2 through 10 are identical to one another.

Ok, but BetaBayes can potentially handle this–it just estimates a set of ten very low random intercepts associated with each of the 10 points, which together are sufficient to account for all of the very low dissimilarities. But now consider a second example:

Suppose we have 20 sites, and denote sites 1 through 10 as “set A” and sites 11 through 20 as “set B”. Suppose that all sites within set A are identical to one another, all sites within set B are identical to one another, and set A and set B are quite distinct. Let’s also suppose that there’s a categorical covariate at play, with all sites in set A sharing a value of a and all sites in set b sharing a value of b. We want to estimate the average dissimilarity between sites pairs of sites with covariate values a, a, the average dissimilarity between pairs of sites with covariate values b, b, and the average dissimilarity between pairs of sites with covariate values a, b. The problem is: every site is similar to 9 other sites and dissimilar to 10 other sites, so there is no variation here that we can attribute to some sites being generically “more dissimilar” than other sites. The multimembership intercepts in a model like BetaBayes can do nothing other than say that all sites are, on average, moderately dissimilar from one another, and they don’t capture any of the dependency structure. For the same reasons as in the first example above, we will end up overconfident in our estimates, and in this case the multimembership intercepts don’t ameliorate the problem.

Ok, but the above example is specially constructed to ensure that the multimembership intercepts cannot help. What about a less contrived construction? Unfortunately the problems persist. Just to build intuition, consider a situation like the above, but where set A contains 37 sites and set B contains 3. Now the multimembership intercepts get to do something useful. They assign low values to all of the set A sites, and high values to all of the set B sites, capturing that set B sites are more dissimilar to other sites on average than set A sites are. The problem is that there’s no way to do this in a way that doesn’t also imply that set B sites are extremely dissimilar to one another. Each set B site needs a very high intercept to capture it’s high dissimilarity from other sites. When we come to a comparison of 2 set B sites, we add that intercept twice.

One irony of BetaBayes is that it is explicitly based Bradley Terry model, where the random intercepts are differenced (i.e. subtracted), with the modification that the random intercepts are instead added. This is ironic to me, because subtracting the intercepts can be thought of as representing the dissimilarities in a one-dimensional ordination (and introducing a second set of intercepts could provide a two-dimensional ordination, and so forth). I haven’t deeply investigated the actual predictive performance or calibration of BetaBayes versus the actual Bradley Terry model on real data, but I find it hard to believe that the original Bradley Terry model isn’t actually a superior model for these sorts of data, though even the Bradley Terry model fails to capture the dependency structure and likely yields overprecise parameter estimates whenever a one-dimensional ordination fails to capture most of the information contained in higher-order ordinations of the sites.

1 Like

The first thing to understand is how your measure of the difference between two communities (the CV) behaves. Based on your numeric example, it seems that you are using an estimator for the standard deviation that divides by n rather than n - 1. But in this case, with only 2 elements, the coefficient of variation is simply equal to half the absolute difference divided by the mean. The factor of a half can be dropped without loss of information, and you can simply present this metric as the absolute difference divided by the mean. Or alternatively, you can keep the factor of a half and refer to this as the absolute difference divided by the sum. Without loss of generality, you can order the data so that the first site in a pair is always the larger. Then your response is simply the difference divided by the sum (or the mean)–i.e. we can drop the absolute value in the numerator.

It seems to me that in general it should be the case that there’s a pairwise nonindependence problem here comparable to the one described above. In case it’s of interest, consider what happens if we replace division by the mean with division by the larger value. Then if we have larger value a and smaller value b, we get 1 - \frac{b}{a}, which without loss of information we can represent as simply \frac{b}{a}. If we work on the log scale, with \alpha = log(a) and \beta = log(b) we can represent this as log(\frac{b}{a}) = \beta - \alpha, which which is a straight-up distance that necessarily respects the triangle inequality. So you can see that your response is closely related to other responses with known nontrival dependency structures.

Edit: I want to end this with some positive advice about how you might proceed. Here’s one suggestion. Your response variable, ecosystem functioning, is fundamentally a straightforward, univariate quantity, which you transform into a dissimilarity of sorts. Reading between the lines, I think the reason you transform the response to a dissimilarity is so that you can regress it against pairwise dissimilarities between sites. I’m guessing (just a guess) that your whole analysis will become cleaner, both conceptually and computationally, if you instead keep the straightfoward response variable (ecosystem functioning) and transform your dissimilarity-based predictors into site-specific covariate measures. That is, instead of regressing pairwise differences in ecosystem functioing on pairwise dissimilarity matrics and environmental distances, regress actual ecosystem functioning on the first one to several axes of an ordination of the sites based on the desired dissimiilarity metric (i.e. NMDS) and the first one to several principal components of the multivariate environmental data. Then your results will give you a direct way to estimate how different ecosystem functioning is predicted to be at sites of a given distance in ordination space or environmental space, while also enabling you to predict what the actual values should be at both sites and avoiding all of the nonindependence issues discussed above.

1 Like

Thank you, this was really helpful. I think this is the key issue I was missing: when we are modelling dissimilarity, it’s not just the common influence of B we want to account for, in the pairings A-B and B-C. I believe the multi-membership model can deal with that just fine. The problem is that the underlying response at A, B, C… is baked into each pairwise difference.

This is quite distinct from a situation where the magnitude of a response variable is under the influence of two out of N “elements” but the response is not a comparison of the two elements.

Am I moving in the right direction?

1 Like

One way to say it is this: if we have A dissimilar to B, then there is no set of multimembership intercepts for A, B, C that will capture that if C is similar to A it must also be dissimilar to B.

Again, thank you for your insightful comments and the time you spent on this!

A key question I am focussing on is how beta diversity stabilizes communities, so I need to work with at least two sites, so I need to stick to the dissimilarity plan.

However, since the a/b approach (or the log of it), by dividing the smaller through the bigger value, will give me an equivalent response to the CV, I might follow this path.

Yet, I know that the analysis is quite sensitive to spatial autocorrelation, but I tried to address this with my sampling strategy.

This is quite distinct from a situation where the magnitude of a response variable is under the influence of two out of N “elements” but the response is not a comparison of the two elements.
Am I moving in the right direction?

I think this is not the case for my analysis since also the response is a comparison of the functioning values of two sites.

1 Like

Just to reiterate, and then I’ll shut up :), if you have a low-stress ordination of the sites in 2 or 3 dimensions, and you can model how ecosystem function varies along the ordination axes, then you can readily predict the expected difference in ecosystem functioning for a pair of sites with any given pairwise dissimilarity.

I’ve been thinking about this for a while, because I am trying to determine how the genetic distance between pairs of populations is related to the geographical distance and the ecological distance between them. I’ve been following the approach White et al. describe for Bray-Curtis distances between ecological communities. It’s similar in spirit to the mm() approach in brms, but instead of adding the individual site effects, it uses the squared difference (or absolute difference) between them. Specifically, if I understand things correctly, brms uses something like this for the mean model (as does BetaBayes):

\mu_{ij} = \beta_0 + \beta_1x + \alpha_{i,s} + \alpha_{j,s} ,

where \alpha_{i,s} is the individual random effect. I see how this approach fails to capture the structure you describe.

White et al. use this approach

\mu_{ij} = \beta_0 + \beta_1x + (\alpha_{i,s} - \alpha_{j,s})^2.

I think that addresses most of your concerns, or at least concerns raised by the example you proposed. You have to code this in Stan, but it seems to work. One problem is that in the data I’m working with, the posteriors of some \alpha_{i,s} are bimodal, presumably reflecting some non-identifiability inherent in only the squared difference \alpha_{i,s} - \alpha_{j,s} affecting the posterior, not the \alpha_{i,s} individually.

Any thoughts?

2 Likes

This is, in effect, an ordination of sites along a single axis (\alpha), with the assumption that the pairwise dissimilarity between sites scales with the square of the difference between the sites along the ordination axis. In the event that the we use an absolute value instead of squaring the difference, I think this model (without covariates) would more-or-less literally be a 1-dimensional ordination such as might be arrived at via NMDS with one axis.

We expect this model to be bimodal due to the fact that we can negate (i.e. flip) \alpha and arrive at the identical model. This can be dealt with by imposing a sign constraint (i.e. a positivity constraint) on one element of \alpha, preferably an element that sits far to one side of the ordination (i.e. one with an extreme value of \alpha).

It’s worth bearing in mind that sometimes a 1-dimensional ordination is badly inadequate to capture the structure of pairwise dissimilarities (i.e. ordinations can have high stress in one dimension). It would always be possible to add a second (or third or fourth…) axis with a second pair of Bradley-Terry style intercepts \beta. On the other hand it’s also worth bearing in mind that the Bradley-Terry style intercepts need account only for the variation that is not well explained by the covariates in the model, and do not necessarily need to account for variation that is readily explained by the covariates.

I don’t have a strong intuition to understand under what conditions this model yields well-calibrated inference on the fixed effects. I imagine some tuning of the priors is likely necessary to appropriately control how trigger-happy the model is to attribute variation to covariates versus to the random effects. It’s very hard to develop intuition about these processes because when we think about community assembly (or genetic variation) in nature, we don’t imagine that nature first rolls her dice to sample the dissimilarities and then populates the community (or genome) with species (or haplotypes, alleles, base pairs); instead we imagine that nature roles her dice to sample the communities via some generative process, and then the dissimilarities are derived quantities. My personal opinion has drifted more and more towards: model the actual community (or genetic) data based on covariates (and if necessary latent factors, following Warton et al, which is known to be a principled model-based ordination technique for the residual variation in the communities) and treat the dissimilarities as a derived quantity whenever possible. (With that said, I do still use dissimilarity modeling sometimes, but I kind of hold my nose when doing so).

1 Like

Thank you for your very detailed and thoughtful comments. You are absolutely right that nature isn’t generating between population differences, per se. It’s producing populations that have certain characteristics and when we estimate differences between those populations, we’re throwing away some of the information. On the other hand, we treat many types of genetic data as selectively neutral, meaning that we can’t predict a priori what the genetic composition of a population should be, only how different two populations (or a set of populations) might be from one another, meaning that analysis of genetic distances isn’t an unreasonable thing to do. A more satisfactory approach might be to estimate a surface across a region at every locus for which you have data, to imagine those all of those surfaces as samples the same underlying surface, and to make inferences on the underlying surface. I’ll have to think more about that, but I’m afraid it might be computationally infeasible. I’m not familiar with Warton et al., but I will definitely take a look at it. In particular, I want to understand what’s different about the latent variable approach. At first blush, the site random effects (before taking the squared or absolute difference) seem like latent variables to me.

In the meantime, I agree that the squared difference captures only one dimension of the between site differences and that we expect the posterior for individual site effects to be multimodal. After further exploration and model checking, however, I feel pretty comfortable with the inferences I’m getting. Here’s why:

The data consist of 171 pairwise Fst’s between 19 populations, the squared difference in 8 environmental covariates, and a 0/1 indicator (goldblatt) for whether the populations are in the same or a different vegetation community. Here’s what the summary statistics look like:

With the exception of the summer covariate, the bulk and tail ESS for all of the parameters look pretty good (4 chains, 1000 warmup iterations, 1000 sample iterations). The posterior predictions also look pretty good.

There is a small bump in the density for observed data around a genetic distance of 0.2 that isn’t picked up in the posterior predictions, but the discrepancy seems pretty small. Maybe that’s where the squared difference in random effects isn’t fully accounting for the dependency in pairwise distances. Also, for what it’s worth the Bayes R^2 is about 58%.

I’m most interested in whether any associations between pairwise Fst and the environmental covariates are well supported (in the sense of having a clear positive or negative sign). The positive association with geographical distance is expected and clear. The negative association with summer looks clear, but the low bulk and tail ESS estimates give me pause. The negative association with map is suggestive. If we look at conditional effects plots (hand-coded, since I’m using cmdstanr I couldn’t use conditional effects from brms), they reinforce the impression of a strong positive association with geographical distance and weakly supported (at best) negative associations with summer and map and no clear associations with any of the other covariates.