# Calculating R-squared for predicted values

I need to compare the performance of my bayesian model against some others built by different methods. One of the metrics I hope to use is the R-squared. Libraries like sklearn allow you to calculate the R-squared using the actual and predicted values of a hold-out set; Sklearn R-Squared Docs

My model was fit using brms. I know brms has the bayes_R2() function to calculate the bayesian R2 using the fitted residuals, but is there an equivalent that allows the actual and predicted values of a second hold-out set as input?

I found Vehtari et al.'s notebook on calculating bayesian R2 https://avehtari.github.io/bayes_R2/bayes_R2.html, which I found helpful. I adapted their method to make use of actual and predicted values from a hold out setā¦

``````bayes_R2_res <- function(mod) {
# method obtained from https://avehtari.github.io/bayes_R2/bayes_R2.html
y <- mod\$datahold\$avg_sales. # this is the target in the hold-out set
ypred <- predict(mod, mod\$datahold, allow_new_levels = T, summary=F)  # predict avg_sales in holdout

e <- -1 * sweep(ypred, 2, y)
var_ypred <- apply(ypred, 1, var)
var_e <- apply(e, 1, var)
r2 <- var_ypred / (var_ypred + var_e)
return(as.data.frame(r2))
}
``````

For reference, this produces a median R2 of ~0.38

However, when I compare this to the more typical R2, I get a vastly different result

``````y.predicted <- predict(mod, mod\$datahold, allow_new_levels = T)[,1]  # take the summary column for simplicity's sake
y.actual <- mod\$datahold\$avg_sales
errors <- (y.actual - y.predicted)
1 - sum(errors^2)/sum((y.actual-mean(y.actual))^2)
``````

This returns a value of -79.26 (?!)

taking the square of the correlation coefficient also gives a very different estimate;

``````cor(y.actual, y.predicted)^2  # returns 0.0004
``````

I thought I understood R-squared as a basic metric. I must be making a simple error somewhere but I feel like Iām taking crazy pills!

Appreciate it if anyone can either a) tell me where Iām going wrong, or b) point me to a function that will allow me to calculate R-squared from actual and predicted values, or using a brms model object and a new data set.
Thanks

1 Like

I donāt use R or brms, but one thing to keep in mind that may be causing an issue is that R^2 is computed for your least squares point estimate (or any point
estimate, even if that is not your fit criterion). In a bayesian setting you could compute that calculation for each MCMC sample and youād get a distribution of R^2 values; not sure getting the expected value is the meaningful metric here and itās equivalent to what Vehtari et al. propose, or if their metric is an analogous but sounder version of it.

The main point is: you may compute R^2 for a MAP estimate, but that disregards the rest of the distribution, so it may not be possible to get an actual equivalent using the entire posterior.

Itās still not clear to me how youād get this large negative value, but it may be from doing calculations that average over squared errors from different samples that shouldnāt go together (but Iām not familiar with the R/brms objects, so I canāt be sure about whether that is actually happening).

Thanks for your reply. I had thought to take a couple thousand posterior predictions of the new data and calculate R-squared from those, but ultimately that doesnāt address why the R-squared calculated the ātraditionalā way for a single sample is so negative when the others are not. I would still have expected the results to within the same ballpark.

It should, and unless there is a gross lack-of-fit issue, it will. That number you are getting has to be a mistake, Iām just not sure what kind.
You should look at the fit for the best point estimate (MAP), that must have an R^2 close to whatever frequentist version you are comparing it too (except for the effect of the priors, if theyāre not flat). For other individual samples you should get something that is still close to your āreferenceā value, but less so the lower the likelihood of the sample (if the likelihood is gaussian itās actually the same thing, as you probably know).

With a couple of thousand of samples you still may be squaring differences across samples that shouldnāt be ā I havenāt looked closer, but thatās my best guess for why you are getting some nonsensical numbers.