Workflow: Root Mean Squared Error


#1

I am curious about how other stan user’s workflow as it relates to predictive analysis. As a simple example lets say that you want to calculate the root mean squared error of predicted values vs actual from a multiple regression equation.

For reproducibility lets say that I have the following fake data in R:

#library(tidyverse)
library(rstan)
#library(bayesplot)
#library(brms)

n <- 1000
x1 <- rbinom(n,1,.7)
x2 <- runif(n,1,50)
x3 <- rnorm(n, 20, 10)

a <- 10
b1 <- .5
b2 <- .25
b3 <- 2

e <- rnorm(n, 0, .5^2)
y <- a + b1*x1 + b2*x2 + b3*x3 + e
x <- as.data.frame(list(x1=x1,x2=x2,x3=x3)) %>% as.matrix()

train_x <- x[1:700,]
test_x <- x[700:1000,]
train_y <- y[1:700]
test_y <- y[700:1000]

# multiple regression
fit <- stan(file = "lm_pred.stan", data = list(N=length(train_y),
                                               X=train_x,
                                               Y=train_y,
                                               K=ncol(train_x),
                                               N_new = length(test_y),
                                               X_new = test_x,
                                               Y_new = test_y))

fit

my lm_pred.stan code is :

data{
  int<lower=0> N;
  int<lower=0> K;
  matrix[N, K] X;
  vector[N] Y;
  
  int<lower=0> N_new;
  matrix[N_new, K] X_new;
}

parameters{
  real alpha;
  vector[K] beta;
  real<lower=0> sigma;
  
  vector[N_new] Y_new;
}

model{
  Y ~ normal(alpha + X * beta, sigma);
  Y_new ~ normal(alpha + X_new * beta, sigma);
}

This leaves me with predictions in the console. I am used to using either the predict() function or tidyverse tools. I am just curious how other stan users would calculate RMSE.


#2

Both rstanarm and brms have predict functions that take a newdata argument. For things as simple as your example, you should be using that. If you were to write this in Stan code, Y_new should not be declared in parameters, but rather in generated quantities, like this

generated quantities {
  vector[N_new] Y_new;
  {
    vector[N_new] mu_new = alpha + X_new * beta;
    for (n in 1:N_new) Y_new[n] = normal_rng(mu_new[n], sigma);
  }
}

#3

Thanks, I like using brms and I understand that this would normally be a model that I would fit with brms, but I am practicing stan so I can fit more realistic models in the future.

Is it possible to fit a model in stan then use predict from brms on new data? If so, how would I get a stanfit object from my code above with the purpose of passing it to brms::predict?


#4

No


#5

Okay, so for models written in stan how do most people calculated root mean squared error for predicted values?


#6

Skipping sections of the Stan model you already have, something like this:

data {
  .
  .
  .
  // Observed Y values at the X_new values, not used in model estimation
  vector [N_new] Y_test; 
}
.
.
.
generated quantities {
  vector[N_new] Y_new;
  vector[N_new] indiv_squared_errors;
  real <lower = 0> sum_of_squares;
  real <lower = 0> root_mean_squared_error;

  {
    vector[N_new] mu_new = alpha + X_new * beta;
    for (n in 1:N_new)  {
      Y_new[n] = normal_rng(mu_new[n], sigma);
      indiv_squared_errors[n] = (Y_test[n] - Y_new[n])^2;
    }
  }
  // sum of squares, data set size and scale specific
  sum_of_squares = sum(indiv_squared_errors);

  // divide by number of new / test points and sqrt for RMSE
  root_mean_squared_error = sqrt(sum_of_squares / N_new);
  
}

#7

The usually condition on all the data and calculate the ELPD with loo::loo.