LOO Model Comparison Alternative

I ran a simulation study comparing two models and the fit comes out to be equivalent in terms of LOO. However, when i run both models through real data, it doesnt line up as cleanely and i have some high pareto K values.

I think this is because there are some outliers in the data. But these datapoints are not mistakes so I should not be removing them. My assumption here is that with this kind of LOO result, i should not be comparing models with it. I then sought to do K fold as i saw that is an option that can handle some more extreme data points but I am struggling setting this up for my data.

In short my model is a count model with theta[j] + beta[i] + gamma[k], where theta is a person parameter, beta are individual item effects, and gamma are item type effects. So for instance, the real data has 1140 people with 16 items, with 3-5 items per item type (total of 3 different types).

my cmdstan code is as follows for the more complex model:


mod_ecmp <- write_stan_file('
functions{

  real approximation(real log_lambda, real nu){
  
      real log_mu = log_lambda / nu;
      real nu_mu = nu * exp(log_mu);
      real nu2 = nu^2;
      // first 4 terms of the residual series
      real log_sum_resid = log1p(
        nu_mu^(-1) * (nu2 - 1) / 24 +
        nu_mu^(-2) * (nu2 - 1) / 1152 * (nu2 + 23) +
        nu_mu^(-3) * (nu2 - 1) / 414720 * (5 * nu2^2 - 298 * nu2 + 11237)+
        nu_mu^(-4) * (nu2 - 1) / 39813120 * (5 * nu2^3 - 1887*nu2^2 - 241041*nu^2 + 2482411)

      );
      return nu_mu + log_sum_resid  -
            ((log(2 * pi()) + log_mu) * (nu - 1) / 2 + log(nu) / 2);  
  
  }
  real summation(real log_lambda, real nu){
  
    real z = negative_infinity();
    real z_last = 0;

    real t  = 0;
    for (j in 0:1000000) {
      t = t + 1;
      z_last = z;
      z = log_sum_exp(z, j * log_lambda  - nu * lgamma(j+1));
  
      if ((abs(z - z_last) < 1e-05)) {
        break;
      }
      if( t > 999998){
        reject("max terms hit, prob bad value");
      }
    }
    return z;
  }
  
  
  real partial_sum(array[] int slice_n, int start, int end,
                   array[] int y_slice,
                   array[] int jj_slice, array[] int ii_slice, vector theta,
                   vector beta, vector nu, vector alpha, array[] int kk_slice, vector gamma) {
    real partial_target = 0.0;
    
    for (n in start : end) {
      real log_lambda = (alpha[ii_slice[n]] * theta[jj_slice[n]] + beta[ii_slice[n]] + gamma[kk_slice[n]]) * nu[ii_slice[n]];
      real log_prob = 0;

    if((log_lambda > 8.25 && nu[ii_slice[n]] > 4 && log_lambda < 40) ||
        (log_lambda > 1 && log_lambda < 11 && nu[ii_slice[n]] > 1.5 && nu[ii_slice[n]] < 3.1))
    {

      

       log_prob = y_slice[n] * log_lambda -
                 nu[ii_slice[n]] * lgamma(y_slice[n] + 1) -
                  approximation(log_lambda, nu[ii_slice[n]]);
      } else {
       log_prob = y_slice[n] * log_lambda -
                   nu[ii_slice[n]] * lgamma(y_slice[n] + 1) -
                   summation(log_lambda, nu[ii_slice[n]]);

      }
                   
      partial_target += log_prob;
    }
    return partial_target;
   }
  
  
}
data {
  int<lower=0> I;
  int<lower=0> J;
  int<lower=1> N;
  int<lower=1> K;
  array[N] int<lower=1, upper=I> ii;
  array[N] int<lower=1, upper=J> jj;
  array[N] int<lower=1, upper=K> kk;
  array[I] int<lower=1,upper=K> item_type_for_beta;
  array[N] int<lower=0> y;
  array[N] int seq_N;
  int<lower=0> grainsize;
}

parameters {

  vector[J] theta;
  vector[K] gamma;
  vector[I] beta;
  
  vector<lower=0>[I] alpha;
  real<lower=0, upper = 3> sigma_alpha;
  
  vector<lower=0.0001, upper = 4>[I] nu;  //.2
  real<lower=0, upper = 5> sigma_nu;
  vector<lower=0, upper = 3>[K] sigma_beta_k;
  
  real<lower=0, upper = 5> sigma_gamma;
  real<lower=0, upper = 5> mu_gamma;


}
model {
  
  sigma_nu ~ cauchy(0,5);
  nu ~ lognormal(0, sigma_nu);
  
  theta ~ normal(0, .3); 
  
  alpha ~ lognormal(0,sigma_alpha);
  sigma_alpha ~ cauchy(0,5);
  
  gamma ~ normal(mu_gamma,sigma_gamma);
  mu_gamma ~ normal(0,5);
  sigma_gamma ~ cauchy(0,5);

  sigma_beta_k ~ cauchy(0,5);
  
  for (i in 1:I) { 
    beta[i] ~  normal(gamma[item_type_for_beta[i]], sigma_beta_k[item_type_for_beta[i]]);
  }
  
    target += reduce_sum(partial_sum, seq_N, grainsize, y, jj, ii, theta,
                       beta, nu, alpha, kk, gamma);
}
generated quantities {
  array[N] real log_lik;

  for (n in 1:N) {
      real log_lambda = nu[ii[n]] * (alpha[ii[n]] * theta[jj[n]] + beta[ii[n]] + gamma[kk[n]]);
          if((log_lambda > 8.25 && nu[ii[n]] > 4 && log_lambda < 40) ||
        (log_lambda > 1 && log_lambda < 11 && nu[ii[n]] > 1.5 && nu[ii[n]] < 3.1))
    {
          log_lik[n] = y[n] * log_lambda - nu[ii[n]] * lgamma(y[n] + 1) - approximation(log_lambda, nu[ii[n]]);
      } else {
          log_lik[n] = y[n] * log_lambda - nu[ii[n]] * lgamma(y[n] + 1) - summation(log_lambda, nu[ii[n]]);
      }
  }
}


')

where
image

How can i do K fold analysis with this data? Is it as simple as just taking observations and putting them into folds? or does the structure of my data change how I should go about this? I also want to explore Mixture IS leave-one-out cross-validation for high-dimensional Bayesian models • loo this type of mixture LOO, however, i am not sure i am able to do this while using reduce_sum, which may cause issues with runtime ( I could be wrong about the implementation here though).

Thanks!

Looking at the effective number of parameters p_loo which is about 1482, and knowing you have, if I count correctly, fewer parameters 1140+16+3+3+3+16, this would indicate your data model has thinner tails than the true data distribution (see, two parts in 1) LOO package glossary — loo-glossary • loo, and 2) LOO package glossary — loo-glossary • loo.

Before trying to solve how to compute more reliable cross-validation, I recommend doing some posterior predictive checking and consider thicker tailed model. I’m not able quickly to see from your partial_sum code what is the actual model.

It seems like you could do it, as I don’t see any non-factorizing dependency structure.

You can’t do it with one reduce_sum. You might be able to do it with to reduce_sums but there might be numerical issues. However, you should first check your model’s predictive distributions are reasonably calibrated before trying to improve CV computation.

I tried implementing K fold, but it does not seem to work well. Do you see any issues here in my stan code? Trying to compare these two models.

mod_rpcm <- stan_model(model_code = 'data {
  int<lower=0> I;
  int<lower=0> J;
  int<lower=1> N;
  int<lower=1,upper=I> ii[N];
  int<lower=1,upper=J> jj[N];
  int<lower=0> y[N];

}
parameters {
  vector[J] theta;
  vector[I] beta;
  real<lower=0> sigma_beta;
  real mu_beta;
}

model {
  theta ~ normal(0,0.3);

  sigma_beta ~ cauchy(0, 5);
  mu_beta ~ normal(0,5);
  beta ~ normal(mu_beta, sigma_beta);

  y ~ poisson_log( theta[jj] + beta[ii]);
}
generated quantities {
  vector[N] log_lik;
    for (n in 1:N) {
    log_lik[n] = poisson_log_lpmf(y[n] | theta[jj[n]] + beta[ii[n]]);
  }
}
')

mod_arpcm <- stan_model(model_code = 'data {
  int<lower=0> I;
  int<lower=0> J;
  int<lower=1> N;
  int<lower=1,upper=I> ii[N];
  int<lower=1,upper=J> jj[N];
  int<lower=0> y[N];
}
parameters {
  vector[J] theta;
  vector[I] beta;
  
  real<lower=0> sigma_beta;
  real mu_beta;
  
  vector<lower=0>[I] alpha;
  real<lower=0> sigma_alpha;
}

model {

  theta ~ normal(0,0.3);
  
  sigma_beta ~ cauchy(0, 5);
  mu_beta ~ normal(0,5);
  beta ~ normal(mu_beta, sigma_beta);
    
  sigma_alpha ~ cauchy(0,2);
  alpha ~ lognormal(0,sigma_alpha);

  y ~ poisson_log(alpha[ii].*theta[jj] + beta[ii]);

}
generated quantities {
  vector[N] log_lik;
    for (n in 1:N) {
    log_lik[n] = poisson_log_lpmf(y[n] | alpha[ii[n]]*theta[jj[n]] + beta[ii[n]]);
  }
}
')

data <- dat_calib$Action$Science$S12
data <- data[,-5]
data <- data[complete.cases(data), ]
temp <- data[,-1]

I <- ncol(temp)
J <- nrow(temp)
temp <- as.vector(t(temp))
N <- I*J
ii <- rep(1:I, times = J)
jj <- rep(1:J, each = I)


stan_data <- list(I = I,
                  J = J,
                  N = N,
                  ii = ii,
                  jj = jj,
                  y = temp,
                  grainsize =1,
                  seq_N = 1:N)

# Full data list (same for all folds)
stan_data_full <- list(
  I = I,
  J = J,
  N = N,
  ii = ii,
  jj = jj,
  y = temp
)

#####
# Helper function to subset data
subset_stan_data <- function(data, indices) {
  list(
    I = data$I,
    J = data$J,
    N = length(indices),
    ii = data$ii[indices],
    jj = data$jj[indices],
    y = data$y[indices]
  )
}
n_folds <- 5

# Initialize storage for expected log predictive density (ELPD) and pointwise log-lik
elpd_rpcm <- numeric(n_folds)
elpd_arpcm <- numeric(n_folds)

# Randomly assign each observation (y[n]) to a fold
fold_ids <- sample(rep(1:n_folds, length.out = N))  # N = total # of responses

#--- Main k-fold loop ---
for (fold in seq_len(n_folds)) {
  cat("\nProcessing fold", fold, "\n")
  
  test_idx <- which(fold_ids == fold)
  train_idx <- which(fold_ids != fold)
  
  train_data <- subset_stan_data(stan_data_full, train_idx)
  test_data  <- subset_stan_data(stan_data_full, test_idx)
  
  # Fit on training data
  fit_model <- sampling(
    mod_arpcm,
    data = train_data,
    chains = 4,
    iter = 4000,
    warmup = 2000,
    thin = 1,
    control = list(adapt_delta = 0.85),
    seed = 123
  )
  
  #-----------------------------------------------------------------
  # Step: Manually compute the log-likelihood for the test set
  #       using the posterior draws from the training fit.
  #-----------------------------------------------------------------
  
  # Extract the posterior draws we need
  # Each is an array [iteration, ?].
  # alpha:  [iterations, I]
  # beta:   [iterations, I]
  # theta:  [iterations, J]
  pars <- rstan::extract(
    fit_model, 
    pars = c("alpha","beta","theta")
  )
  
  # We'll compute log_lik for each test point, for each posterior draw
  # Then average (mean) across draws to get pointwise predictions.
  
  S <- length(pars$alpha[,1])  # number of posterior draws
  
  # Container for each test data point's average log-likelihood
  test_log_lik_mean <- numeric(length(test_idx))
  
  # For each test observation n
  for(i_obs in seq_along(test_idx)) {
    # "n" is the row in the original (full) data
    n <- test_idx[i_obs]
    
    # Indices for this observation
    i_item  <- stan_data_full$ii[n]  # item ID
    j_person <- stan_data_full$jj[n] # person ID
    y_obs    <- stan_data_full$y[n]  # the observed count
    
    # For each draw s, compute log-lambda -> log-lik
    # alpha[i_item, s] * theta[j_person, s] + beta[i_item, s] 
    # Then dpois(..., log=TRUE)
    ll_draws <- numeric(S)
    for(s in seq_len(S)) {
      log_lambda <- pars$alpha[s, i_item] * pars$theta[s, j_person] +
        pars$beta[s,  i_item] 
      # The log-likelihood at draw s
      ll_draws[s] <- dpois(y_obs, exp(log_lambda), log = TRUE)
    }
    
    # Mean over the S draws
    test_log_lik_mean[i_obs] <- mean(ll_draws)
  }
  
  # Summation of average pointwise log-likelihood for test points
  elpd_arpcm[fold] <- sum(test_log_lik_mean)
  

  cat("\nFitting RPCM", fold, "\n")
  # Fit on training data
  fit_model <- sampling(
    mod_rpcm,
    data = train_data,
    chains = 1,
    iter = 4000,
    thin = 1,
    warmup = 2000,
    control = list(adapt_delta = 0.85),
    seed = 123
  )
  
  #-----------------------------------------------------------------
  # Step: Manually compute the log-likelihood for the test set
  #       using the posterior draws from the training fit.
  #-----------------------------------------------------------------
  
  # Extract the posterior draws we need
  # Each is an array [iteration, ?].
  # beta:   [iterations, I]
  # theta:  [iterations, J]
  pars <- rstan::extract(
    fit_model, 
    pars = c("beta","theta")
  )
  # We'll compute log_lik for each test point, for each posterior draw
  # Then average (mean) across draws to get pointwise predictions.
  S <- length(pars$beta[,1])
  
  # Container for each test data point's average log-likelihood
  test_log_lik_mean <- numeric(length(test_idx))
  
  # For each test observation n
  for(i_obs in seq_along(test_idx)) {
    # "n" is the row in the original (full) data
    n <- test_idx[i_obs]
    
    # Indices for this observation
    i_item  <- stan_data_full$ii[n]  # item ID
    j_person <- stan_data_full$jj[n] # person ID
    y_obs    <- stan_data_full$y[n]  # the observed count
    
    # For each draw s, compute log-lambda -> log-lik
    # theta[j_person, s] + beta[i_item, s] 
    # Then dpois(..., log=TRUE)
    ll_draws <- numeric(S)
    for(s in seq_len(S)) {
      log_lambda <- pars$theta[s, j_person] +
        pars$beta[s,  i_item]     
      
      # The log-likelihood at draw s
      ll_draws[s] <- dpois(y_obs, exp(log_lambda), log = TRUE)
    }
    
    # Mean over the S draws
    test_log_lik_mean[i_obs] <- mean(ll_draws)
  }
  
  # Summation of average pointwise log-likelihood for test points
  elpd_rpcm[fold] <- sum(test_log_lik_mean)
  
  
  
  cat("ELPD for aRPCM fold", fold, ":", elpd_arpcm[fold], "\n")
  cat("ELPD for RPCM fold", fold, ":", elpd_rpcm[fold], "\n")
}

It seems that at least you make an error in computing elpd, as you compute only means of log densities, while you need to do one mean for densities.

Check the loo package vignette Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo, which shows how to do K-fold-CV using generated quantities block and use elpd() function which does the correct computation with numerically stable way.

1 Like