Model comparison, Log-Likelihood and WAIC

Hi there.

I have a four different models and would like to compare them.
I am working with Hierarchical Linear Models, so I have:

  1. Pooled Model
  2. Unpooled Model

as reference models (which should be outperformed).

and then

  1. Hierarchical Model with 1 Layer
  2. Hierarchical Model with 2 Layers

as models that should “probably” be better.

I have calculated the WAIC of the first 3 models and they’re exactly the same.
Now, I am not sure whether I did smth wrong or the WAIC is just the wrong measure to compare models. Basically I want to show with a measure (RMSE, IC, etc.) that model 3 and 4 are better but I don’t know what the industry standard is for Bayesian data analysis.

The 2nd question is: I have the proper Log-Likelihood for the first 3 stan models but don’t know how to properly right it out for the 4th one. I’d be happy if someone can help me out.

3rd question: I would like to perform out of sample predictions with LOO or even leaving out entire counties to predict them but this seems very tidious.
For once because I don’t know how to do it using the Stan model but also because there are certain counties that only have 1 obs. Leaving them out would mean that an entire county is out and would kinda destroy the purpose of what I wanted to do. Is there a possibility to limit which observations are chosen for the LOO?

Here’s the minimal R code:

## Load Data

data("radon", package = "rstanarm")
data <- radon


# County Categorical Variable

county_name <- as.vector(data$county) # create vector of county indications for each obs
county_list <- unique(county_name) # create list of counties, drop duplicates
J <- length(county_list) # number counties in Minessota

county <- rep (NA, J) 
for (i in 1:J){
  county[county_name==county_list[i]] <- i
}


# Seperate variables
logradon <- data$log_radon
loguranium <- data$log_uranium
loguranium <- loguranium[!duplicated(loguranium)]
floorind <- data$floor
N <- length(logradon)
county_i <- county 
county_j <- 1:85


# Lists for STAN
cplist <- list(Y = logradon,
               X = floorind,
               N = N)

unplist <- list(Y = logradon,
                 X = floorind,
                 N = N,
                 J = J,
                 county = county)

datalist <- list(Y = logradon,
                 X = floorind,
                 U = loguranium,
                 N = N,
                 J = J,
                 county_i = county_i,
                 county_j = county_j)

Here’s the working Stan code for the 1st HLM:

// HLM 1 

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data (floor)
	int<lower=1> county[N]; //county ID variable
}

parameters {
	vector[J] alpha; // county j intercept 
  real beta; 
	real mu_alpha; // prior on alpha
	real <lower=0> sigma_alpha; // hyperparameter sigma of alpha
	real<lower=0> sigma_y;// prior sigma of y in county j
}

model { // Model specifications
  vector[N] mu_y;
  
  // Full Model
	for(i in 1:N)
	mu_y[i] = alpha[county[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 5);
  
  alpha ~ normal(mu_alpha, sigma_alpha);
  mu_alpha ~ normal(0, 10);
	sigma_alpha ~ cauchy(0, 5);
	
  beta ~ normal(0, 10);
}

generated quantities {
  vector[N] log_lik;
  for (i in 1:N) log_lik[i] = normal_lpdf(Y[i]| alpha[county[i]] + beta * X[i], sigma_y);
}

Here’s the 2nd Stan code with not yet correct Log Likelihood:
(The model sigma is a complex term which is composed of sigma_y^2 and sigma_alpha^2.
How can I write that down?)

// HLM 2

data {
	int<lower=1> N; // number of obs
	int<lower=1> J; // Number of counties
	vector[N] Y; // outcome data (log radon)
	vector[N] X; // Inputa data Layer 1(room indication)
	vector[J] U; // Input data Layer 2 (uranium)
	int<lower=1> county_i[N]; //county ID variable for x
	int<lower=1> county_j[J]; //county ID variable for u
}

parameters {
  real beta; // fixed slope
  vector[J] alpha; // county j varying intercept 
	real gamma_zero; // 2nd level fixed intercept 
	real gamma_one; // 2nd level fixed slope
	real<lower=0> sigma_alpha; // prior on noise of 2nd layer model
	real<lower=0> sigma_y;// prior on noise of 1st layer model
}

model { // Model specifications
  vector[N] mu_y;
  
   // 2nd Layer
	for(j in 1:J)
	alpha[county_j[j]] ~ normal(gamma_zero + gamma_one * U[county_j[j]], sigma_alpha);
  
  // 1st Layer
	for(i in 1:N)
	mu_y[i] = alpha[county_i[i]] + beta * X[i]; 
	
// Priors and Hyperparams
  Y ~ normal(mu_y, sigma_y);
  sigma_y ~ cauchy (0, 2.5);
  
  sigma_alpha ~ cauchy(0,2.5);
	
  beta ~ normal(0, 5);
  
  gamma_zero ~ normal(0, 5);
  gamma_one ~ normal(0, 5);
}

generated quantities {
  vector[N] log_lik;
  for (i in 1:N)
  for (j in 1:J)
  log_lik[i] = normal_lpdf(Y[i]| gamma_zero + gamma_one * U[county_j[j]] + beta * X[i], sigma_y);
}
1 Like

Hi,
sorry for taking too long to respond. The method of first choice for model comparison is usually the LOOIC of the loo package. WAIC can be problematic. You seem to be aware of the loo package - is there a reason you are not using the LOOIC criterion?
It is also usually good to ask oneself if model comparison is really what you are after, I’ve personally found it rarely answers questions I would have - what is the scientific problem you are trying to solve? Or is this just an excercise for you?

Some minor things:

  • log_lik[i] = normal_lpdf(Y[i]| alpha[county[i]] + beta * X[i], sigma_y); you could probably move mu to transformed parameters and reuse the computation as log_lik[i] = normal_lpdf(Y[i]| mu[i], sigma_y);

I don’t think I understand what you are trying to ask, could you elaborate?

I think it should be the same as in the first case - you compute mu[i] for each observation and then the likelihood is just normal_lpdf(Y[i]| mu[i], sigma_y);. Or am I missing something?

Hope that helps!