Calculating R-squared for predicted values

I need to compare the performance of my bayesian model against some others built by different methods. One of the metrics I hope to use is the R-squared. Libraries like sklearn allow you to calculate the R-squared using the actual and predicted values of a hold-out set; Sklearn R-Squared Docs

My model was fit using brms. I know brms has the bayes_R2() function to calculate the bayesian R2 using the fitted residuals, but is there an equivalent that allows the actual and predicted values of a second hold-out set as input?

I found Vehtari et al.'s notebook on calculating bayesian R2 https://avehtari.github.io/bayes_R2/bayes_R2.html, which I found helpful. I adapted their method to make use of actual and predicted values from a hold out setā€¦

bayes_R2_res <- function(mod) {
  # method obtained from https://avehtari.github.io/bayes_R2/bayes_R2.html
  y <- mod$datahold$avg_sales. # this is the target in the hold-out set
  ypred <- predict(mod, mod$datahold, allow_new_levels = T, summary=F)  # predict avg_sales in holdout
  
  e <- -1 * sweep(ypred, 2, y)
  var_ypred <- apply(ypred, 1, var)
  var_e <- apply(e, 1, var)
  r2 <- var_ypred / (var_ypred + var_e)
  return(as.data.frame(r2))
}

For reference, this produces a median R2 of ~0.38

However, when I compare this to the more typical R2, I get a vastly different result

y.predicted <- predict(mod, mod$datahold, allow_new_levels = T)[,1]  # take the summary column for simplicity's sake
y.actual <- mod$datahold$avg_sales
errors <- (y.actual - y.predicted)
1 - sum(errors^2)/sum((y.actual-mean(y.actual))^2)

This returns a value of -79.26 (?!)

taking the square of the correlation coefficient also gives a very different estimate;

cor(y.actual, y.predicted)^2  # returns 0.0004

I thought I understood R-squared as a basic metric. I must be making a simple error somewhere but I feel like Iā€™m taking crazy pills!

Appreciate it if anyone can either a) tell me where Iā€™m going wrong, or b) point me to a function that will allow me to calculate R-squared from actual and predicted values, or using a brms model object and a new data set.
Thanks

1 Like

I donā€™t use R or brms, but one thing to keep in mind that may be causing an issue is that R^2 is computed for your least squares point estimate (or any point
estimate, even if that is not your fit criterion). In a bayesian setting you could compute that calculation for each MCMC sample and youā€™d get a distribution of R^2 values; not sure getting the expected value is the meaningful metric here and itā€™s equivalent to what Vehtari et al. propose, or if their metric is an analogous but sounder version of it.

The main point is: you may compute R^2 for a MAP estimate, but that disregards the rest of the distribution, so it may not be possible to get an actual equivalent using the entire posterior.

Itā€™s still not clear to me how youā€™d get this large negative value, but it may be from doing calculations that average over squared errors from different samples that shouldnā€™t go together (but Iā€™m not familiar with the R/brms objects, so I canā€™t be sure about whether that is actually happening).

Thanks for your reply. I had thought to take a couple thousand posterior predictions of the new data and calculate R-squared from those, but ultimately that doesnā€™t address why the R-squared calculated the ā€˜traditionalā€™ way for a single sample is so negative when the others are not. I would still have expected the results to within the same ballpark.

It should, and unless there is a gross lack-of-fit issue, it will. That number you are getting has to be a mistake, Iā€™m just not sure what kind.
You should look at the fit for the best point estimate (MAP), that must have an R^2 close to whatever frequentist version you are comparing it too (except for the effect of the priors, if theyā€™re not flat). For other individual samples you should get something that is still close to your ā€œreferenceā€ value, but less so the lower the likelihood of the sample (if the likelihood is gaussian itā€™s actually the same thing, as you probably know).

With a couple of thousand of samples you still may be squaring differences across samples that shouldnā€™t be ā€“ I havenā€™t looked closer, but thatā€™s my best guess for why you are getting some nonsensical numbers.