Workflow: Root Mean Squared Error


I am curious about how other stan user’s workflow as it relates to predictive analysis. As a simple example lets say that you want to calculate the root mean squared error of predicted values vs actual from a multiple regression equation.

For reproducibility lets say that I have the following fake data in R:


n <- 1000
x1 <- rbinom(n,1,.7)
x2 <- runif(n,1,50)
x3 <- rnorm(n, 20, 10)

a <- 10
b1 <- .5
b2 <- .25
b3 <- 2

e <- rnorm(n, 0, .5^2)
y <- a + b1*x1 + b2*x2 + b3*x3 + e
x <-,x2=x2,x3=x3)) %>% as.matrix()

train_x <- x[1:700,]
test_x <- x[700:1000,]
train_y <- y[1:700]
test_y <- y[700:1000]

# multiple regression
fit <- stan(file = "lm_pred.stan", data = list(N=length(train_y),
                                               N_new = length(test_y),
                                               X_new = test_x,
                                               Y_new = test_y))


my lm_pred.stan code is :

  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] X;
  vector[N] Y;
  int<lower=0> N_new;
  matrix[N_new, K] X_new;

  real alpha;
  vector[K] beta;
  real<lower=0> sigma;
  vector[N_new] Y_new;

  Y ~ normal(alpha + X * beta, sigma);
  Y_new ~ normal(alpha + X_new * beta, sigma);

This leaves me with predictions in the console. I am used to using either the predict() function or tidyverse tools. I am just curious how other stan users would calculate RMSE.


Both rstanarm and brms have predict functions that take a newdata argument. For things as simple as your example, you should be using that. If you were to write this in Stan code, Y_new should not be declared in parameters, but rather in generated quantities, like this

generated quantities {
  vector[N_new] Y_new;
    vector[N_new] mu_new = alpha + X_new * beta;
    for (n in 1:N_new) Y_new[n] = normal_rng(mu_new[n], sigma);


Thanks, I like using brms and I understand that this would normally be a model that I would fit with brms, but I am practicing stan so I can fit more realistic models in the future.

Is it possible to fit a model in stan then use predict from brms on new data? If so, how would I get a stanfit object from my code above with the purpose of passing it to brms::predict?




Okay, so for models written in stan how do most people calculated root mean squared error for predicted values?


Skipping sections of the Stan model you already have, something like this:

data {
  // Observed Y values at the X_new values, not used in model estimation
  vector [N_new] Y_test; 
generated quantities {
  vector[N_new] Y_new;
  vector[N_new] indiv_squared_errors;
  real <lower = 0> sum_of_squares;
  real <lower = 0> root_mean_squared_error;

    vector[N_new] mu_new = alpha + X_new * beta;
    for (n in 1:N_new)  {
      Y_new[n] = normal_rng(mu_new[n], sigma);
      indiv_squared_errors[n] = (Y_test[n] - Y_new[n])^2;
  // sum of squares, data set size and scale specific
  sum_of_squares = sum(indiv_squared_errors);

  // divide by number of new / test points and sqrt for RMSE
  root_mean_squared_error = sqrt(sum_of_squares / N_new);


The usually condition on all the data and calculate the ELPD with loo::loo.