What is the correct way to calculate residual variance for use in the R-squared value suggested in Gelman et al. 2019?
If I understand correctly, modeled residuals variance, the variance measure suggested by Gelman et al, is sigma^2 for a simple linear regression with normal likelihood
model {
y ~ normal(mu, sigma);
}
I am aware of this helpful appendix , but still had some uncertainty about the correct calculation because the document uses the stan_glm function and does not show the full model code uses under the hood.
Part of my uncertainty comes from not fully understanding why variance of residuals calculated using the predictions, and the modeled residuals variance, are not the same.
The residual variance is the conditional variance given the model, whereas the total variance is just the variance of the underlying random variable. These are usually both defined as sample statistics.
I posted some code to do this just yesterday in a response to one of Andrew’s blog posts on this topic with some R code using glm() for the calculations.
The mathematical formula is given in equation (3) of that paper. In this expression, s represents one of the S posterior simulation draws. So expression (3) is calculated S times to yield a posterior distribution of R-squared.
That is, from our perspective, R-squared is a function of the true parameters, and so it has a posterior distribution. To say it again, we compute R-squared separately for each posterior simulation draw–it is a function of data and model parameters.
Equation (3) in the paper involves two terms:
(a) V_{n=1}^N y_n^{pred s}
(b) var_{res}^s.
To explain each piece here:
V_{n=1}^N is the sample variance function (the sum of the squared differences from the sample mean, divided by N-1).
The subscript n represents the data index (n=1,…,N, with N data points).
As discussed above, the superscript s represents the posterior simulation draw (s=1,…,S, with S draws).
y_n^{pred s} is defined just above equation (3); it is the posterior prediction, that is, the expected value of a new data point with predictors X_n given the parameter vector theta^s.
var_{res}^s is defined by taking the formula for var_{res} below equation (2) in the paper and inserting theta^s for theta in that expression. Also, y_n^{tilde} represents the predicted value of y given X_n and theta.
A nice feature of this way of thinking is that you can extend it to partial R^2-like quantities which are indexes of variable importance. See this for a relative explained variation measure that admits uncertainty intervals to display the difficulty of the task of picking ‘winners’.