R2 for Bayesian models v. Nakagawa delta method

I am looking to sensibly partition the variance from a mixed-effects logit model fit via stan. I understand typical Bayesian R^2 (Bayesian R2 and LOO-R2) is:

R^2 = \frac{\sigma^2_{\mu}}{\sigma^2_{\mu} + \sigma^2_{\epsilon}}.

I am however interested in obtaining estimates of the share of variance explained by each model component - so suppose a set of fixed effects with variance \sigma^2_f and a random effect with variance \sigma^2_a, then I’d like to calculate:

R_{f}^2 = \frac{\sigma^2_{f}}{\sigma^2_{f} + \sigma^2_{a} + \sigma^2_{\epsilon}}, \mbox{ and } R_{a}^2 = \frac{\sigma^2_{a}}{\sigma^2_{f} + \sigma^2_{a} + \sigma^2_{\epsilon}}.

Classic Bayesian approaches for logistic regression calculate \sigma^2_{\epsilon} at the predicted-probability level, as opposed to the latent logit level. I try to stick to this and calculate the component-specific variance with an approach which uses the predicted probabilities at each level (similarly to what is suggested in the old Gelman (2006) paper in the context of linear models). For every posterior simulation, I calculate:

\widehat{\sigma^2}_a = Var \mbox{ ilogit}( \theta_a),

But this has the very undesirable property to produce R^2 s that do not sum to 1 ( and at times are larger than 1 on their own), regardless of which method for calculating the residual variance is chosen.

The Nakagawa et al. (2017) paper offers an interesting solution using the corrected delta method to estimate the residual variance at the latent (logit) level. Based on other posts in this forum (e.g. here) I have seen people have a familiarity with the paper. However it is formulated from a frequentist prospective, and i’m having trouble applying this to a Bayesian model.

The formula they derive for the latent-scale variance is:

\sigma^2_\epsilon \approx \frac{1}{\hat{p}(1-\hat{p})};\\ \hat{p} \approx \mbox{ilogit}(\beta_0) + 0.5\sigma^2_\tau \frac{exp(\beta_0)(1-exp(\beta_0)}{(1+exp(\beta_0))^3} ;

where \beta_0 is the model intercept; \sigma^2_\tau is the sum of all random-effect variance components in the model, and \hat{p} is the best guess for the expected value of our bernoulli-distributed outcome variable y.

If I calculate the R^2 replacing \sigma^2_\tau with the variance of my linear predictor Var{(\mu)} , I get the following:

Which would seem to suggest this is a sensible approach which both gives R2 which is consistent with other R2 measures, and allows me to partition the variance as I want to.

I have the following questions which I hope folks on the forum may help with:

  1. The Nakagawa formula is very insistent on this distinction between random and fixed effects - namely \sigma^2_\tau is meant to be the sum of the variances of random effect components (not including the fixed-effects variance) - but from a Bayesian prospective I do not think there is any actual difference – everything is a ‘random’ effect (or better a ‘modeled’ effect) – so I thought it would be sensible to also include the variance of the fixed effect components in \sigma^2_\tau – hence my choice to replace it with Var{(\mu)} .
    Is this a sensible approach in your estimation ?

  2. Relatedly, the paper suggests it is possible to replace \beta_0 with the fixed effects – namely \beta_0 + \sum^K_{k=1}\beta_k x_{i,k}. Here I do not understand why I could not / should not include the random effects in this predictor — again I’m not sure about how the conceptual difference between frequentist and bayesian random effects interacts with the Nakagawa formula;

  3. Is it somehow possible to obtain the variance partitioning I want - namely a set of R^2 for each model component of interest which also sum to 1 - within the loo framework ?

Thanks in advance for your help and time.

I don’t have answers to your specific questions, but if you haven’t seen it, this paper proposes a full framework for variance partitioning in a Bayesian context that you may find useful, and which addresses corner cases found in a lot of more complex models.

Doors and corners of variance partitioning in statistical ecology
Torsti Schulz, Marjo Saastamoinen, Jarno Vanhatalo
bioRxiv 2021.10.17.464682; doi: Doors and corners of variance partitioning in statistical ecology | bioRxiv

3 Likes

Thanks for sharing that paper - this is very interesting and relevant - especially the methods to account for covariance amongst linear terms. On the terms of this paper, my example above would be a strictly ‘diagonal variance partition’.

Still, if I understand the paper correctly, the authors do not deal with a full variance-partitioning exercise, but rather stick to partitioning the linear predictor. This is cool and very useful, but we still cant calculate R^2 if the residuals are non-lineariseable.

The authors say:

The approach is not limited to partitioning the variation in the linear predictor, but can also be used for the more traditional partitioning of explained variation, R^2, when working with a linear model or a model with linearizable residuals.

I guess one could combine the Nakagawa’s formula to linearise the variance of a Bernoulli-distributed variable, and the linear-predictor partitioning proposed in this paper to obtain a fully partitionable measure of R^2, but my questions about how exactly to do that in a Bayesian context remain, unfortunately, unanswered.

Hi folks,

I’ve come back to this a few years later and I think things are a bit clearer for me – I’ll put down the reasoning in case it’s helpful for anyone.


In a binomial logit GLMM, I have

x_i \mid \beta_{j[i]} \sim Binomial(n_i,\pi_i);
\mu_i = \log\left(\frac{\pi_i}{1-\pi_i}\right) = \alpha + \beta_{j[i]};
\beta_j \sim N(0,{\sigma_\beta}^2).

E[x_i \mid \beta_{j[i]} ] = n_i\pi_i;
Var[x_i \mid \beta_{j[i]} ] = n_i\pi_i(1-\pi_i).

We need to estimate the residual variance to calculate R^2. By the law of total variance, we can decompose total variation in the binomial counts in observation-level (residual) variance + random-effect variance:

Var[x_i] = E_\beta[Var(x_i\mid \beta_{j[i]})] + Var_\beta[E[x_i \mid \beta_{j[i]}]]

Where E_\beta[Var(x_i\mid \beta_{j[i]})] describes how much the counts fluctuate after we fix the random (and fixed) effects to their realised values; whilst Var_\beta[E[x_i \mid \beta_{j[i]}]] describes the average difference in expected counts between the j groups.

The residual variance across counts is computable, though not immediately useful – we wish our R^2 metric to be computed on the scale of the linear predictor, where different sources of variance can be compared on the same scale.

Given the number of Bernoulli trials n_i is constant, we can focus exclusively on variation arising from conditional probabilities \pi_i. We do so on the logit scale for the reasons mentioned above:

Var[\mu_i] = Var[logit(\pi_i)] =\\ E_\beta[Var(logit(\pi_i)\mid \beta_{j[i]})] + Var_\beta[E[logit(\pi_i)\mid \beta_{j[i]}]] =\\ = \sigma^2_\epsilon + \sigma^2_\beta

Hence the quantity we want to estimate is the residual variance on the logit scale: \sigma^2_\epsilon = E_\beta[Var(logit(\pi_i)\mid \beta_{j[i]})].

One issue we face is that we never observe \pi_i – we do not sample this quantity directly, so it’s challenging to describe its sampling variability. To overcome this, define the following sample estimator for the conditional probability \pi_i:

\hat{\pi}_i = \frac{x_i}{n};
E[\hat{\pi}_i \mid \beta_{j[i]}] = \pi_i;
Var[\hat{\pi}_i\mid \beta_{j[i]}] =\frac{\pi_i(1-\pi_i)}{n_i}.

Now that we have this, we can get Var(logit(\hat{\pi}_i)\mid \beta_{j[i]}) via delta method:

g(x) \approx g(E[x]) + g’(E[x])(x - E[x]);
Var\left[ g(x) \right] \approx ( g’(E[x])) ^2 Var[x].

Var[ g(\hat{\pi}_i) \mid \beta_j] = Var\left[ log\left( \frac{\hat{\pi}_i}{1-\hat{\pi}_i} \right) \right] \approx \left( \frac{1}{\pi_i(1-\pi_i)} \right)^2 Var[\hat{\pi}_i] = \frac{1}{n_i\pi_i(1-\pi_i)}.

This is the conditional residual variance.


To get the marginal residual variance we need to compute the expected value of Var[ g(\hat{\pi}_i) \mid \beta_j] with respect to the distribution of betas.

This is non-trivial. As best as I can tell, Nakagawa et al. propose multiple approaches:

  1. approximate \pi_i with the sample estimator \hat{\pi}_i, or better with its predicted value conditional on simulated \beta_{j[I]}, and take the average across observations:
    \hat{\sigma}^2_{\epsilon} = \frac{1}{N} \sum_i \frac{1}{n_i\hat{\pi}_i(1-\hat{\pi}_i)}

  2. replace \pi_i with \bar{\pi}, the expectation of \pi_i with respect to \beta:
    \bar{\pi} = E_\beta[\pi_i \mid \beta_{j[i]}]

With this second simplification, the objective then becomes to get an unbiased estimate of \bar{\pi}.
– One option is the sample average, though Nakagawa et al. caution against this because taking average of the observed conditional probabilities is not equivalent marginalising over the distribution of \beta.
– Another intuitive option is logit^{-1}(\alpha), given \alpha is the global mean on the logit scale. But unfortunately this is equivalent to logit^{-1}(E[ \mu_i])], not E[ logit^{-1}(\mu_i)] , and due to Jensen’s inequality, those two expectations are not the same ^1.

Nakagawa et al. propose a second order Taylor expansion of logit^{-1}(\mu), around the expected value of \mu that can be used to correct for this bias:

g(x) \approx g(E[x]) + g’(E[x])(x-E[x]) + \frac{1}{2} g’’(E[x])(x-E[x])^2
E[g(x)] \approx g(E[x]) + \frac{1}{2} Var[x] g’’(E[x])

Taking here the expectation with respect to \beta, we have:

E_{\beta}[\pi_i] = E_{\beta}[logit^{-1}(\mu_i)] \approx \\ \approx \frac{exp(E[\mu_i])}{1+exp(E[\mu_i])} + \frac{1}{2} Var[\mu_i] \frac{exp(E[\mu_i])\left(1-exp(E[\mu_i])\right)}{(1+exp(E[\mu_i]))^3} = \\ = \frac{exp(\alpha)}{1+exp(\alpha)} + \frac{1}{2} \sigma_\beta^2 \frac{exp(\alpha)\left(1-exp(\alpha)\right)}{(1+exp(\alpha))^3}

Here we can then plug-in the estimates of \alpha and \sigma_\beta to get the approximately unbiased estimate of \pi_i, and in turn plug that into the variance formula obtained from the 1st delta method to get the observation-level, logit-scale, residual variance.


So what should a Bayesian do ? The 2nd delta method formula seems fairly inconsequential from a Bayesian prospective: we have the luxury of sampling from the posterior of \pi_i, conditional on distribution of \beta, so the 1st delta method formula is where I would stop, and simply simulate the posterior of \sigma^2_{\epsilon} = \frac{1}{N} \sum_i s^2_{\epsilon,i}, where s^2_{\epsilon,i} is the conditional variance described above.

Even better, in a MrP setting one could get the weighted posterior predictive distribution of \sigma^2_{\epsilon} by weighted average according to a stratification frame.

Apologies for any typos / mistakes etc. , if I got anything wrong would love to hear.


^1 It does seem strange to worry about Jensen’s inequality only here, when E_{\beta}[Var(logit{-1}(\pi_i) \mid \beta_{j[i]})] = E_{\beta}\left[\frac{1}{n_i\pi_i(1-\pi_i)}\right] suffers from the similar pathology if we just sub E_\beta[p_i] in for \pi_i.

1 Like