Modeled residual variance for R-squared

bennyb · June 12, 2024, 5:32pm

What is the correct way to calculate residual variance for use in the R-squared value suggested in Gelman et al. 2019?

If I understand correctly, modeled residuals variance, the variance measure suggested by Gelman et al, is sigma^2 for a simple linear regression with normal likelihood

model {
y ~ normal(mu, sigma);
}

I am aware of this helpful appendix , but still had some uncertainty about the correct calculation because the document uses the stan_glm function and does not show the full model code uses under the hood.

Part of my uncertainty comes from not fully understanding why variance of residuals calculated using the predictions, and the modeled residuals variance, are not the same.

Bob_Carpenter · June 19, 2024, 10:42am

I guess this is a question for @andrewgelman. I’d suggest first reading the Wikipedia page:

The key formula for the general form is

R^2 = 1 - \frac{\displaystyle \textrm{VAR}^{\textrm{residual}}} {\displaystyle \textrm{VAR}^{total}}.

The residual variance is the conditional variance given the model, whereas the total variance is just the variance of the underlying random variable. These are usually both defined as sample statistics.

I posted some code to do this just yesterday in a response to one of Andrew’s blog posts on this topic with some R code using glm() for the calculations.

https://statmodeling.stat.columbia.edu/2024/06/17/this-well-known-paradox-of-r-squared-is-still-buggin-me-can-you-help-me-out/#comment-2374157

andrewgelman · June 20, 2024, 10:11am

The mathematical formula is given in equation (3) of that paper. In this expression, s represents one of the S posterior simulation draws. So expression (3) is calculated S times to yield a posterior distribution of R-squared.

That is, from our perspective, R-squared is a function of the true parameters, and so it has a posterior distribution. To say it again, we compute R-squared separately for each posterior simulation draw–it is a function of data and model parameters.

Equation (3) in the paper involves two terms:
(a) V_{n=1}^N y_n^{pred s}
(b) var_{res}^s.
To explain each piece here:

V_{n=1}^N is the sample variance function (the sum of the squared differences from the sample mean, divided by N-1).
The subscript n represents the data index (n=1,…,N, with N data points).
As discussed above, the superscript s represents the posterior simulation draw (s=1,…,S, with S draws).
y_n^{pred s} is defined just above equation (3); it is the posterior prediction, that is, the expected value of a new data point with predictors X_n given the parameter vector theta^s.
var_{res}^s is defined by taking the formula for var_{res} below equation (2) in the paper and inserting theta^s for theta in that expression. Also, y_n^{tilde} represents the predicted value of y given X_n and theta.

harrelfe · June 20, 2024, 11:38am

A nice feature of this way of thinking is that you can extend it to partial R^2-like quantities which are indexes of variable importance. See this for a relative explained variation measure that admits uncertainty intervals to display the difficulty of the task of picking ‘winners’.

bennyb · June 24, 2024, 8:07pm

Thanks everyone for the replies.

To go back to my original question: is var_res the same as sigma^2 for a simple linear regression?

model {
y ~ normal(mu, sigma);
}

avehtari · June 25, 2024, 8:52am

Yes.

Topic		Replies	Views
Residual variance for different families in {brms} General brms	1	46	April 20, 2025
R-squared from rstan (and not rstanarm) Modeling	5	1860	March 18, 2025
Which Input for Bayesian R Squared when using STAN (not RSTANARM) Modeling techniques	2	728	July 12, 2018
Normal linear regression with known variance via rstanarm rstanarm	1	656	May 30, 2019
Defining an unconventional residual covariance for a multivariate response model Modeling	4	776	September 13, 2018

Modeled residual variance for R-squared

Related topics