Discrepancy between model comparison criteria and posterior predictive checks

Hi all, I was hoping to get some input on a curious result I’ve found regarding discrepancies between model comparison criteria values (Marginal Likelihood and ELPD) and a simple posterior predictive check of the fitted models.

In short, I am fitting and comparing a number of reinforcement learning models (example stan code below). I have no problems with sampling the model parameters with rstan and consistently have good convergence and adequate diagnostic values of other sampling variables (rhat, effective sample size, etc.). To compare the different models, I (1) calculate an estimate of a model’s marginal likelihood using bridge sampling via the ‘bridgesampling’ package, (2) calculate an estimate of a model’s expected log predictive density (ELPD) via the ‘loo’ package, and (3) calculate the percent of trials where choices are correctly predicted by a model (posterior predictive check), where the model’s predicted choices are generated during sampling within the generated quantities section of the model and I simply compute the average error between the actual choices and a model’s predictions across samples.

Now, the curious thing is that the model comparison criteria (Marginal Likelihood and ELPD) give an ordered ranking of the various models that generally agree between criteria, but the simple posterior predictive check gives chance performance for a subset of models whose ML/ELPD values are “competitive” with the other models. That is, all three methods of comparing model performance agree for most models, but the PPC tanks for a subset of models despite that subset of models’ ML/ELPD values appearing normal.

I’ve added some plots below to illustrate this point. For instance, why would model 7’s PPC be at chance level when the ML/ELPD values indicate an improved fit over prior models? Has anyone come across this or something similar before? Any insights/speculations are greatly appreciated. Please let me know if I can provide any further information to help clear this up.

criteria.pdf (115.6 KB)

ppc.pdf (459.8 KB)

data {
  int<lower=1> N;
  int<lower=1> T;
  int<lower=1, upper=T> T_subjs[N];
  int<lower=-1, upper=8> option1[N, T];
  int<lower=-1, upper=8> option2[N, T];
  int<lower=-1, upper=2> choice[N, T];
  real outcome[N, T];
}

transformed data {
  row_vector[3] initV;
  initV = rep_row_vector(0.0, 3);
}

parameters {
  // Declare all parameters as vectors for vectorizing
  // Hyper(group)-parameters
  vector[3] mu_pr;
  vector<lower=0>[3] sigma_pr;

  // Subject-level raw parameters 
  vector[N] learnrate_pr;
  vector[N] discount_pr;
  vector[N] tau_pr;
}

transformed parameters {
  // subject-level parameters
  vector<lower=0, upper=1>[N] learnrate;
  vector<lower=0, upper=1>[N] discount;
  vector<lower=0, upper=20>[N] tau;
  
  for (i in 1:N) {
    learnrate[i] = Phi_approx(mu_pr[1] + sigma_pr[1] * learnrate_pr[i]);
    discount[i]  = Phi_approx(mu_pr[2] + sigma_pr[2] * discount_pr[i]);
    tau[i]       = Phi_approx(mu_pr[3] + sigma_pr[3] * tau_pr[i])*20;
  }
}

model {
  // Hyperppriors defining mean and standard deviation of {learnrate, tau, gamma} parameters
  mu_pr ~ normal(0,1);
  sigma_pr ~ normal(0,1);

  // individual parameters
  learnrate_pr ~ normal(0,1);
  discount_pr  ~ normal(0,1);
  tau_pr       ~ normal(0,1);

  // subject loop and trial loop
  for (i in 1:N) {
    matrix[6,3] ev;
    int decision;
    real PE;
    vector[2] option_values = [ 0, 0 ]'; 			

    for (idx in 1:6) { ev[idx] = initV; }

    for (t in 1:T_subjs[i]) {

      decision = (choice[i, t] > 1) ? option2[i, t] : option1[i, t];

      option_values = [ ev[option1[i, t], 1] , ev[option2[i, t], 1] ]';
   
      // compute action probabilities
      target += categorical_lpmf(choice[i, t] | softmax( option_values*tau[i] ));

      for (e in 1:3) {
	if (e < 3) {
	  PE = discount[i] * ev[decision, e+1] - ev[decision, e];
          ev[decision, e] += learnrate[i] * PE;
	} else {
	  PE = outcome[i, t] - ev[decision, e];
          ev[decision, e] += learnrate[i] * PE;
	}
      }
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_learnrate;
  real<lower=0, upper=1> mu_discount;
  real<lower=0, upper=20> mu_tau;

  // For log-likelihood values and posterior predictive check
  real log_lik[N];
  real y_pred[N,T];

  for (i in 1:N) {
    for (t in 1:T) {
      y_pred[i,t] = -1;
    }
  }

  mu_learnrate   = Phi_approx(mu_pr[1]);
  mu_discount    = Phi_approx(mu_pr[2]);
  mu_tau         = Phi_approx(mu_pr[3])*20;    

  { for (i in 1:N) {

    matrix[6,3] ev;
    int decision;
    real PE;
    vector[2] option_values = [ 0, 0 ]'; 	

    log_lik[i] = 0;

    for (idx in 1:6) { ev[idx] = initV; }

    for (t in 1:T_subjs[i]) {

      decision = (choice[i, t] > 1) ? option2[i, t] : option1[i, t];

      option_values = [ ev[option1[i, t], 1] , ev[option2[i, t], 1] ]';

      // compute log likelihood of current trial
      log_lik[i] += categorical_lpmf(choice[i, t] | softmax( option_values*tau[i] ));

      // generate posterior prediction for current trial
      y_pred[i,t] = categorical_rng( softmax( option_values*tau[i] ));

      for (e in 1:3) {

        if (e < 3) {
	  PE = discount[i] * ev[decision, e+1] - ev[decision, e];
          ev[decision, e] += learnrate[i] * PE;
	} else {
	  PE = outcome[i, t] - ev[decision, e];
          ev[decision, e] += learnrate[i] * PE;
	}

      } // episode loop
    } // trial loop
  } // individual loop
  } // local section
}

I’d have to look more closely into the model to give more useful feedback, but what sticks out in your “posterior predictive check” is that it’s not really a posterior predictive check. A posterior predictive check would be assessing what the data would look like under a distribution of parameter values [ “check” p(D | \theta) from an inference made about p(\theta | D ) ], as I understand it this is usually done visually, but ELPD would be a way of putting a numeric value to that instead – as expected if the model is reasonable and the method works as intended there’s goo agreement with the likelihood estimates.

What you are doing is comparing a different outcome of the fitted model (percentage of correct trials), so it’s not the flip side of inference and may show some disagreement with the likelihood/posterior. It’s odd that the performance of some models is so poor, but one possibility is that the posterior probability is high because the trial it gets right are done with high confidence, as opposed to getting more right with a lower confidence (you could also check that by putting together a sort of table with the binary right/wrong outcome next to the probability of predicting the outcome).

Another (related) possibility is that some models predict specific values that are associated to higher variances (e.g. p=0.5 in a Bernoulli trial has higher variance taken over several trials, as in a Binomial distribution), so there may be a statistical justification for the discrepancy between the likelihood and a practical criterion.