Calculating R-squared for predicted values

I need to compare the performance of my bayesian model against some others built by different methods. One of the metrics I hope to use is the R-squared. Libraries like sklearn allow you to calculate the R-squared using the actual and predicted values of a hold-out set; Sklearn R-Squared Docs

My model was fit using brms. I know brms has the bayes_R2() function to calculate the bayesian R2 using the fitted residuals, but is there an equivalent that allows the actual and predicted values of a second hold-out set as input?

I found Vehtari et al.'s notebook on calculating bayesian R2 https://avehtari.github.io/bayes_R2/bayes_R2.html, which I found helpful. I adapted their method to make use of actual and predicted values from a hold out set…

bayes_R2_res <- function(mod) {
  # method obtained from https://avehtari.github.io/bayes_R2/bayes_R2.html
  y <- mod$datahold$avg_sales. # this is the target in the hold-out set
  ypred <- predict(mod, mod$datahold, allow_new_levels = T, summary=F)  # predict avg_sales in holdout
  
  e <- -1 * sweep(ypred, 2, y)
  var_ypred <- apply(ypred, 1, var)
  var_e <- apply(e, 1, var)
  r2 <- var_ypred / (var_ypred + var_e)
  return(as.data.frame(r2))
}

For reference, this produces a median R2 of ~0.38

However, when I compare this to the more typical R2, I get a vastly different result

y.predicted <- predict(mod, mod$datahold, allow_new_levels = T)[,1]  # take the summary column for simplicity's sake
y.actual <- mod$datahold$avg_sales
errors <- (y.actual - y.predicted)
1 - sum(errors^2)/sum((y.actual-mean(y.actual))^2)

This returns a value of -79.26 (?!)

taking the square of the correlation coefficient also gives a very different estimate;

cor(y.actual, y.predicted)^2  # returns 0.0004

I thought I understood R-squared as a basic metric. I must be making a simple error somewhere but I feel like I’m taking crazy pills!

Appreciate it if anyone can either a) tell me where I’m going wrong, or b) point me to a function that will allow me to calculate R-squared from actual and predicted values, or using a brms model object and a new data set.
Thanks

1 Like

I don’t use R or brms, but one thing to keep in mind that may be causing an issue is that R^2 is computed for your least squares point estimate (or any point
estimate, even if that is not your fit criterion). In a bayesian setting you could compute that calculation for each MCMC sample and you’d get a distribution of R^2 values; not sure getting the expected value is the meaningful metric here and it’s equivalent to what Vehtari et al. propose, or if their metric is an analogous but sounder version of it.

The main point is: you may compute R^2 for a MAP estimate, but that disregards the rest of the distribution, so it may not be possible to get an actual equivalent using the entire posterior.

It’s still not clear to me how you’d get this large negative value, but it may be from doing calculations that average over squared errors from different samples that shouldn’t go together (but I’m not familiar with the R/brms objects, so I can’t be sure about whether that is actually happening).

Thanks for your reply. I had thought to take a couple thousand posterior predictions of the new data and calculate R-squared from those, but ultimately that doesn’t address why the R-squared calculated the ā€˜traditional’ way for a single sample is so negative when the others are not. I would still have expected the results to within the same ballpark.

It should, and unless there is a gross lack-of-fit issue, it will. That number you are getting has to be a mistake, I’m just not sure what kind.
You should look at the fit for the best point estimate (MAP), that must have an R^2 close to whatever frequentist version you are comparing it too (except for the effect of the priors, if they’re not flat). For other individual samples you should get something that is still close to your ā€œreferenceā€ value, but less so the lower the likelihood of the sample (if the likelihood is gaussian it’s actually the same thing, as you probably know).

With a couple of thousand of samples you still may be squaring differences across samples that shouldn’t be – I haven’t looked closer, but that’s my best guess for why you are getting some nonsensical numbers.