BRMS - null model prediction for hold out sample

I did some control analyses, after getting some good results on my method of interest: I’ve fit a control model (only intercept) of the format (rows ~ 600)

fit.null = brm(y ~ 1,data=df,chains=4,iter=1000,

When I then apply this model to unseen data:

predicted=,newdata=, summary=TRUE)
my.prediction = predicted$Estimate

To test how good the prediction is, my plan was (for my model of interest) to correlate the predicted and the actual values in the new data.


What I’m noticing now is that for some models, sometimes this is pretty significant (probably because n=300 in the second sample) - (I wasn’t going to apply any corrections to those p-values). 1) That shouldn’t be the case (if I had fit a non-Bayesian model, the prediction should have been EXACTLY the same for everyone). 2) That makes me worried about what I’m concluding for my models of interest.

I’m wondering what causes this and how I can solve this? I’m considering whether getting more samples would help? For now, I’ve compared my model of interest’s predictions to the prediction of the null model (with out-of-sample-R2) and a permutation test on that difference, but I was hoping to ‘simplify’ and thought just a correlation would be the easiest for others to understand.
Edit: I forgot that I have a prior on the intercept (added now). My regressors are also z-score normalised. I guess I could make the prior on the intercept tighter?

I am not sure about this method of looking at predictions by correlation tests and test statistics like p-values, but to answer your question, I think sometimes you get low p-values simply due to the size of the sample you are predicting and the random nature of generating predictions using posterior samples and a random number generator.
If I follow correctly what you are doing, it is something like the below code. The simulations show that you get slightly different mean point predictions each time you predict, and for a sample this size, sometimes you get p < 0.05 and sometimes not, for some small correlation.
I personally would not evaluate model predictions in this way (certainly not using p-values), but perhaps someone else will chime in and correct me if I am mistaken.

#training data
x <- rnorm(600, 0, 1)
mu <- 5 + 3*x
y <- rnorm(600, mu, 1)
d <-,x)

#fit model
fit.null <- brm(y ~ 1, data=d, chains=4, iter=1000, cores=4)

#new data
x <- rnorm(300, 0, 1)
mu <- 5 + 3*x
y <- rnorm(300, mu, 1)
nd <-,x)

#using predict()
p_value <- rep(0,20)
for(i in 1:20){
predicted <- data.frame(predict(fit.null, newdata=nd, summary=TRUE))
my.prediction <- predicted$Estimate
ct <- cor.test(my.prediction, nd$y)
p_value[i] <- ct$p.value
print(p_value[p_value < 0.05])

#manual using samples and rnorm
s1 <- as_draws_df(fit.null)
p_value <- rep(0,100)
for(i in 1:100){
mu.m <- s1$b_Intercept 
sd.m <- s1$sigma
preds <- rnorm(300, mu.m, sd.m)
ct <- cor.test(preds, nd$y)
p_value[i] <- ct$p.value
print(p_value[p_value < 0.05])


Just a note that brms comes directly integrated with state-of-the-art tools for evaluating the predictive performance of a model on held-out data (approximate leave-one-out cross validation via loo.brmsfit), as well as for evaluating something like the strength of correlation captured by the model (bayes_R2.brmsfit), which in turn should be related to the difference in the size of the residual standard deviation in your model versus the null model.

1 Like

Thanks both!! … sigh, I had just removed the code [without even properly looking at the results because I hadn’t noticed the correlation problem yet…] that used loo because when showing it to colleagues, we thought the correlations were easier to understand (because a very familiar tool)… but you are right, I need to at least also report this given that it’s more appropriate (and then highlight when the two methods don’t agree).

@jd_c yes that’s what I did and the problem exactly as you’re saying.

@jsocolar I’m looking at the documentation for bayes_R2 - I think it is not possible to do ‘out-of-sample R2’ (i.e. on my hold-out data)? I’ve previously computed this as 1- (mse.full/mse.simple) using the point-wise predictions errors (mean squared error) from the full and simple/control models on the hold-out data.

I’ve tried implementing loo now, but I think I must be doing something wrong.

I’ve made some data frames to show (ok, some problems uploading, I’ll try upload on google drive and link to it).

This is for a binary model. I get 66% prediction accuracy for my ‘model of interest’ and 51% for my control model - in hold out data. Yet, when I try to get an estimate of whether my model of interest is better using loo, I find that the simpler model is preferred - surely that shouldn’t be the case given the difference in prediction accuracy?

# fit the models
model.full = brm(formula= 'y~1+x1+x2+x3+x4+x5+x6+x7+x8+x9+x10+x11+x12+x13+x14+x15+x16+x17+x18',
model.simple = brm(formula='y~1',

# get out of sample predictions
predict.full   =,newdata=df.holdout))$Estimate
predict.simple =,newdata=df.holdout))$Estimate

# print accuracy
mean(round(predict.full)== df.holdout$y)
mean(round(predict.simple)== df.holdout$y)

# using brms and loo
loo.full = loo(model.full,newdata=df.holdout)
model.weights= loo_model_weights(model.full,model.simple,newdata=df.holdout) 

Edit: I’ll paste the print outs from the code above:

> mean(round(predict.full)== df.holdout$y)
[1] 0.6567863
> mean(round(predict.simple)== df.holdout$y)
[1] 0.5070203

> loo_compare(loo.full,loo.simple)
             elpd_diff se_diff
model.simple   0.0       0.0  
model.full   -39.1      20.0  

> model.weights
Method: stacking
model.full   0.471 
model.simple 0.529 

Here is the link to some example data:

1 Like

Couple of things–sorry for a somewhat scattered post.

There’s nothing wrong conceptually with evaluating r-squared on new data, and brms supports this, though it’s not immediately obvious from looking at the doc for bayes_R2. The newdata argument is understood via the ... which is passed to posterior_epred. For example

f1 <- brm(Sepal.Length ~ Petal.Length, data = iris, cores = 4, backend = "cmdstanr")
nd <- data.frame(Sepal.Length = rnorm(20), Petal.Length = rnorm(20))
bayes_R2(f1, newdata = nd)

I’m not sure it’s meaningful to talk about disagreement between the two methods because I’m not sure the correlation approach is answering a relevant question. We know by construction that the actual population level correlation between the null model predictions and the response is zero because the predictions are iid. Anything else you see is sampling variation.

This isn’t impossible or even necessarily surprising. Your approach discretizes the predictions and asks whether they match the data in a yes/no sense. When loo talks about the predictive accuracy, it is talking about the match between the entire posterior predictive distribution and the data. Suppose, for example, that the held-out data are 0, 1, 1. And based on the fitted model I make predictions 0, 0, 1 based on underlying fitted probabilities of 0.3, 0.0001, 0.9. The check-if-predictions-match approach says “two-out-of-three; not bad!”. The loo approach says “gah! we observed a 1 when the fitted probability was 0.0001. Big problem!” The loo approach would much prefer to see predictions of 1, 0, 0 based on fitted probabilities of 0.6, 0.4, 0.4 (zero out of three correct).

1 Like

In addition to what @jsocolar mentioned about LOO, I wanted to add some things, and maybe he can comment if he has used LOO in the way you describe. I haven’t used PSIS-LOO like this on the holdout data. I have used PSIS-LOO as a tool for model evaluation on the data to which the model was fit, not holdout data. To quote Aki Vehtari in his super helfpul CV FAQ, “Cross-validation is a family of techniques that try to estimate how well a model would predict previously unseen data by using fits of the model to a subset of the data to predict the rest of the data.” So, LOO provides a way to estimate how well the model makes predictions, when all you have is the data that you fit your model to. When you have holdout data, then you could use holdout validation as described here
Of course there are other metrics, of which accuracy is one. You might be interested in this vignette Bayesian Logistic Regression with rstanarm

Here is something that you may want to think about when considering your accuracy estimate - you are rounding your prediction, when your prediction from predict() is actually the mean of a collection of samples that are either 0 or 1. The predict() function returns 0 or 1 for each of the number of samples from your model, with probability of inverse logit of the linear predictors part of the model. The Estimate in brms predict() is the mean of this collection of 0 or 1 samples. But take a look at the Est.Error and you will see how sure the model really is of this Estimate. It is easier to think about this when using fitted() which returns the point estimate (mean) of the probability distribution for that prediction. In the case of logistic regression, your full expectation for a single row of new data is technically a collection of samples that are sampled from the probability distribution of the expectation, where the expectation is the probability that y=1. For a concrete example, imagine that your y is binary 0 or 1, where 0=susceptible and 1=resistant, for a model where you have some genes as predictors and you are trying to predict antibiotic resistance. You fit your model with 4000 total post-warmup samples. Now you have a single row of new data that you feed into the model and use fitted() for y. What comes out is 4000 samples that are probabilities for the new data being y=1. Let’s say you summarize those and get a mean for those samples of 0.55 with 95% quantiles of 0.25 and 0.85. The 0.55 number is the Estimate that brms gives when summary=TRUE (the default) and is close to the same number from Estimate from predict() that you are rounding in your prediction in your code above (if you want to see all the samples yourself, just use fitted(fit, summary=FALSE) ). The 0.55 point prediction would round up to 1, so you would predict 1, or resistant, but in reality, the model really doesn’t know much - it’s more or less a coin toss, but it doesn’t even know if it is a coin toss either, as the probability is so spread out! It just doesn’t know. If you got something like 0.9 (0.85 - 0.95), then the model is reasonably sure that it is resistant. The point is, that the point estimates may not really be a good reflection of what the model thinks the prediction really is. Depending on the application, it could be more helpful to use fitted to get the expectation and the variability in that. See the example code below to see examples of all of this.

x <- rnorm(600, 0, 1)
a <- -1
b <- 0.1
p <- plogis(a + b*x)
y <- rbinom(600, 1, p)
d <-,x)

m <- brm(y ~ 1 + x, family=bernoulli, data=d, cores=4, iter=2000, warmup=1000)

x <- rnorm(1, 0, 1)
a <- -1
b <- 0.1
p <- plogis(a + b*x)
y <- rbinom(1, 1, p)
nd <-,x)

f <- fitted(m, newdata=nd, summary=FALSE)
quantile(f, c(0.025, 0.975))
fitted(m, newdata=nd, summary=T)
hist(f, breaks=100)

pr <- predict(m, newdata=nd, summary=FALSE)
quantile(pr, c(0.025, 0.975))
predict(m, newdata=nd, summary=T)
hist(pr, breaks=100)

I point all this out, because depending on your application for making predictions for new data, it might be more helpful to not discard the uncertainty in your predictions. Continuing with my example of predicting antibiotic resistance, if your model is ‘very sure’, say maybe something like 0.90 (0.85 - 0.95) that the new row of data should be y=1 resistant, but it turns out in the lab that the sample is actually susceptible, then the observation that your model is surprisingly wrong can inform your model building (maybe your model is just really bad at prediction, or maybe you are missing some genes or something else in the model) or inform you as the scientist of something potentially unusual (if your model is generally and historically pretty good).

Thanks for pointing me this direction, I think it turns out that my loo call on the BRNS object with new data does something different than I thought.

  1. If I compute the absolute error or log likelihood (instead of % accuracy), the pattern (full model being better) is still as I think it should be:
> mean(abs(predict.full-df.holdout$y))
[1] 0.4013292
> mean(abs(predict.simple - df.holdout$y))
[1] 0.4998912


> log.lik.full

Computed from 1 by 641 log-likelihood matrix using the generic elpd function
     Estimate   SE
elpd   -438.6 18.4
ic      877.2 36.8

> log.lik.simple

Computed from 1 by 641 log-likelihood matrix using the generic elpd function

     Estimate  SE
elpd   -444.8 1.4
ic      889.6 2.8

And then if I compare them I get:

> loo_compare(log.lik.full,log.lik.simple)
       elpd_diff se_diff
model1  0.0       0.0   
model2 -6.2      18.3   

So, somehow, this is different rom using ‘loo’ together with ‘newdata’.

loo.full = loo(model.full,newdata=df.holdout)


> loo.full

Computed from 4000 by 641 log-likelihood matrix

         Estimate   SE
elpd_loo   -486.9 20.1
p_loo        56.3  4.0
looic       973.8 40.2
Monte Carlo SE of elpd_loo is 0.1.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     640   99.8%   315       
 (0.5, 0.7]   (ok)         1    0.2%   5594      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
> loo.simple

Computed from 4000 by 641 log-likelihood matrix

         Estimate  SE
elpd_loo   -446.6 1.4
p_loo         1.8 0.0
looic       893.2 2.8

I’m not sure now what loo combined with newdata is doing…?

The former is holdout validation and the latter is LOO-CV on data the model hasn’t seen. I think you want to use LOO for your training data and holdout validation for you test data.

1 Like

that’s what I thought, but shouldn’t then the values I get from loo(my.model,
be the same?

I think loo combined with newdata is trying to to imagine that it had somehow already seen the new data, and then to use PSIS to rejigger to posterior to remove this (in actuality nonexistent) influence of each point of the new data. If that’s right, then it does not seem like something that is useful in this context. Maybe @avehtari can chime in about what loo is doing when it sees a newdata argument.

When using loo() or log_lik() function with brms object, what happens is defined in brms package, so @paul.buerkner can confirm, but I guess that with newdata log likelihoods are computed using new data, and loo() is then computing what would happen if given each new data observation we would remove the corresponding amount of likelihood contribution from the original total log posterior.


yes exactly.


thanks everyone!