In-Sample Predictions in R with Stan

Hi. I’ve a two-staged HLM script for Stan and want to get the point estimates for in-sample data. The goal is to have a vector of in-sample predictions, so i can compare it with the actual data to compute the mean squared error. Seems to be the most intuitive way to compare the in-sample accuracy of my models.

Q1: How can I get the point estimates from a Stan model or with a R function?
Q2: This is a frequentist way to do it. Is there an alternative why to measure the predictive power of model in a Bayesian manner? If yes what are the methods?

Here’s a minimal viable code:

R script:

## Load Data

data("radon", package = "rstanarm")
data <- radon


# County Categorical Variable

county_name <- as.vector(data$county) # create vector of county indications for each obs
county_list <- unique(county_name) # create list of counties, drop duplicates
J <- length(county_list) # number counties in Minessota

county <- rep (NA, J) 
for (i in 1:J){
  county[county_name==county_list[i]] <- i
}


# Seperate variables
logradon <- data$log_radon
loguranium <- data$log_uranium
loguranium <- loguranium[!duplicated(loguranium)]
floorind <- data$floor
N <- length(logradon)
county_i <- county 
county_j <- 1:85


# Lists for STAN
datalist <- list(Y = logradon,
                 X = floorind,
                 U = loguranium,
                 N = N,
                 J = J,
                 county_i = county_i,
                 county_j = county_j)

Stan script:

// HLM 2

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data Layer 1(room indication)
	vector[J] U; // Input data Layer 2 (uranium)
	int<lower=1> county_i[N]; //county ID variable for x
	int<lower=1> county_j[J]; //county ID variable for u
}

parameters {
  real beta; // fixed slope
	real gamma_zero; // 2nd level fixed intercept 
	real gamma_one; // 2nd level fixed slope
	real<lower=0> sigma_alpha; // prior on noise of 2nd layer model
	real<lower=0> sigma_y;// prior on noise of 1st layer model
}

model { // Model specifications
  vector[N] mu_y;
  vector[J] alpha; // county j varying intercept 
  
   // 2nd Layer
	for(j in 1:J)
	alpha[county_j[j]] ~ normal(gamma_zero + gamma_one * U[county_j[j]], sigma_alpha);
  
  // 1st Layer
	for(i in 1:N)
	mu_y[i] = alpha[county_i[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 2.5);
  
  sigma_alpha ~ cauchy(0,2.5);
	
  beta ~ normal(0, 5);
  
  gamma_zero ~ normal(0, 5);
  gamma_one ~ normal(0, 5);
}

Q1 In our Bayesian workflows we avoid point estimates like crazy. Typically report out the 50% uncertainty interval.

Q2 I run simulated data first through the model to make sure I can recover my true/known paramters first. Then when modelling I toss in the generated block of code like this:

generated quantities{
real Y_pred[N];
for(i in 1:N){
Y_pred[i] = normal_rng(alpha * (1 - exp(-(beta * (x[i]-x0)))), sigma); // Posterior predictive distribution
}
}

for my model ;) You will need something else in there.

2 Likes

A couple of resources that help to address this:
Summary of Betancourt’s talk.

The full Betancourt treatment.
https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html

A paper on such things.

A super compact version.

1 Like

I tried this but it only works for my pooled, unpooled and single-level hlm models. For the code above i get very weird posterior predictive distributions that are worse than all the 3 other models (it should be the other way around according to this paper. Really strange…

If I can get some time today I’ll run your model on my end and start taking a look to see what’s up.

1 Like

That would be nice. I get a posterior predictive distribution but it doesn’t seem to be better than the first model and even as bad as the unpooled and the pooled model - which I don’t understand. But then again, when calculating the RMSE, the RMSE of the HLM 2model is just slightly worse than from the HLM 1 model and better than unpooled and completely pooled. So what I got from that is that they’re different performance measures but I cannot really interpret the posterior predictive distribution as in - when is it good or when is it bad (given that the distributions of all models already look fairly good and the differences are tiny…)