Help with a very simple example of k-fold cross-valiation with loo

I’m trying to learn how to do k-fold cross-validation with loo following the example from appendix A.4. Suppose my data looks like this:

library(dplyr)
set.seed(9782)
N <- 1000
df <- data.frame(x = rnorm(n = N, mean = 10, sd = 2), 
                 e = rnorm(n=N, mean = 0, sd=2)) %>% 
  mutate(y = 10 + 0.5*x + e) %>% select(-e)

train <- sample_frac(df, 0.7)
sid <- as.numeric(rownames(train)) 
test <- df[-sid,]

this is my stan code:


data {
  int<lower=1> N_t ;                            // (training) number of observations
  vector[N_t] x_t ;                             // (training) x
  vector[N_t] y_t ;                             // (training) outcome
  int<lower=1> N_h ;                            // (houldout) number of observations
  vector[N_h] x_h ;                             // (houldout) x
  vector[N_h] y_h ;                             // (houldout) outcome
}

parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}

model {
  alpha ~ normal(0,10);
  beta ~ normal(0,10);
  sigma ~ normal(0,10);
  y_t ~ normal(alpha + beta*x_t, sigma);
}

generated quantities {
  vector[N_h] y_sim_h;
  vector[N_h] log_lik_h;
  vector[N_t] log_lik_t;
  for (n in 1:(N_h))  {
    y_sim_h[n] = normal_rng(alpha + beta*x_h[n], sigma);
    log_lik_h[n] = normal_lcdf(y_h[n] | alpha + beta*x_h[n], sigma);
  }
  
  for (n in 1:(N_t))  {
    log_lik_t[n] = normal_lcdf(y_t[n] | alpha + beta*x_t[n], sigma);
  }
  
}

this is how i fit the model

library(rstan)
stan_list <- list(N_t = nrow(train),
                  x_t = train$x,
                  y_t = train$y,
                  N_h = nrow(test),
                  x_h = test$x,
                  y_h = test$y)

fit <- sampling(model, data = stan_list)
print(fit, pars = c("alpha", "beta", "sigma"))

this the density overlay with bayesplot

library(bayesplot)

y_sim_h <- as.data.frame(fit, pars="y_sim_h") 

ppc_dens_overlay(y = test$y,
                 yrep = as.matrix(y_sim_h[sample(x = 1:nrow(y_sim_h), size = 50),]))

and this is how I use the loo package:

library(loo)
# Extract pointwise log-likelihood and compute LOO
log_lik_1 <- extract_log_lik(fit, parameter_name = "log_lik_h")
loo_1 <- loo(log_lik_1)
print(loo_1)

# 
# marginal predictive check using LOO probability integral transform
psis <- psislw(-log_lik_1)
ppc_loo_pit_overlay(test$y, as.matrix(y_sim_h),
                    lw = psis$lw_smooth)

I have a couple of questions:

  1. Is what i did for my basic example right or am I missing something? For example, should I be running the model again but fliping what is training and test?

  2. Can you point me to an example in which the data is partitioned in 3 instead of 2?

Thanks!

The beginning seems fine.

log_lik_h is already for the hold-out data, so no need to run loo.
Instead compute log(colMeans(exp(log_lik_h))) and there should be also better function
log_colMeans_exp(log_lik_h) or something like that which computes with better numerical accuracy.

Instead of ppc_loo_pit_overlay, ppc_pit_overlay should work like ppc_dens_overlay

Yes, and even better divide data in 10 parts.

New loo 2.0 has kfold-helper functions for creating data divisions, and then it’s easy to make demo.
loo 2.0 is in github now and in CRAN maybe next week?

1 Like

Just to clarify about the plotting functions: ppc_loo_pit_overlay() is the right function name in the new bayesplot v1.5 for the plot comparing to overlaid uniform densities (it calls ppc_dens_overlay() internally).