How can I solve the problem of high Pareto-k values?

My data is the data of reversal learning task consisting of 4 blocks. There are two choices(1 and 2) and the values of two choices reversed when a block is switched to the next block. I have 52 participants and in most cases, participants completed 120 trials. I am trying to fit a hierarchical reinforcement learning model to my data using stan. The problem is that when I use loo function to compute the looic value, most of Pareto k diagnostic values are bad or very bad as follows:

Computed from 12000 by 52 log-likelihood matrix
Estimate SE
elpd_loo -2282.4 155.6
p_loo 65.4 3.4
looic 4564.8 311.1

Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 0 0.0%
(0.5, 0.7] (ok) 8 15.4% 112
(0.7, 1] (bad) 36 69.2% 13
(1, Inf) (very bad) 8 15.4% 12
See help(‘pareto-k-diagnostic’) for details.

Does this mean that my model is misspecified? Is there any problem in my code? I attached my data and code.

Thank you for helping me.

RLdata.txt (171.7 KB)

data {
int<lower=1> N; // Number of subjects

int<lower=1> B; // Maximum number of blocks across subjects
int<lower=1> Bsubj[N]; // Number of blocks for each subject

int<lower=0> T; // Maximum number of trials across subjects
int<lower=0, upper=T> Tsubj[N, B]; // Number of trials/blocks for each subject

int<lower=-1, upper=2> choice[N, B, T]; // The choices subjects made
real outcome[N, B, T]; // The outcome
}

transformed data {
// Default value for (re-)initializing parameter vectors
vector[2] initV;
initV = rep_vector(0.0, 2);
}

// Declare all parameters as vectors for vectorizing
parameters {
// Hyper(group)-parameters
vector[2] mu_pr;
vector<lower=0>[2] sigma;

// Subject-level raw parameters (for Matt trick)
vector[N] alpha_pr; // learning rate
vector[N] beta_pr; // inverse temperature
}

transformed parameters {
// Transform subject-level raw parameters
vector<lower=0, upper=1>[N] alpha;
vector<lower=0, upper=10>[N] beta;

for (i in 1:N) {
alpha[i] = Phi_approx(mu_pr[1] + sigma[1] * alpha_pr[i]);
beta[i] = Phi_approx(mu_pr[2] + sigma[2] * beta_pr[i]) * 10;
}
}

model {
// Hyperparameters
mu_pr ~ normal(0, 1);
sigma ~ normal(0, 1);

// individual parameters
alpha_pr ~ normal(0, 1);
beta_pr ~ normal(0, 1);

for (i in 1:N) {
for (bIdx in 1:Bsubj[i]) { // new
// Define Values
vector[2] ev; // Expected value
real PE; // Prediction error

  // Initialize values
  ev = initV;     // Initial ev values

  for (t in 1:Tsubj[i, bIdx]) {
    // Softmax choice
    choice[i, bIdx, t] ~ categorical_logit(ev * beta[i]);

    // Prediction Error
    PE = outcome[i, bIdx, t] - ev[choice[i, bIdx, t]];

    // Update expected value of chosen stimulus
    ev[choice[i, bIdx, t]] += alpha[i] * PE;
    
  }
}

}
}

generated quantities {
// For group level parameters
real<lower=0, upper=1> mu_alpha;
real<lower=0, upper=10> mu_beta;

// For log likelihood calculation
real log_lik[N];

// For model regressors
real ev_c[N, B, T]; // Expected value of the chosen option
real ev_nc[N, B, T]; // Expected value of the non-chosen option
real pe[N, B, T]; // Prediction error

// For posterior predictive check
real y_pred[N, B, T];

// Initialize all the variables to avoid NULL values
for (i in 1:N) {
for (b in 1:B) {
for (t in 1:T) {
ev_c[i, b, t] = 0;
ev_nc[i, b, t] = 0;
pe[i, b, t] = 0;

    y_pred[i, b, t]   = -1;
  }
}

}

mu_alpha = Phi_approx(mu_pr[1]);
mu_beta = Phi_approx(mu_pr[2]) * 10;

{ // local section, this saves time and space
for (i in 1:N) {

  log_lik[i] = 0;

  for (bIdx in 1:Bsubj[i]) {  // new
    // Define values
    vector[2] ev; // Expected value
    real PE; // prediction error

    // Initialize values
    ev = initV; // initial ev values

    for (t in 1:Tsubj[i, bIdx]) {
      // Softmax choice
      log_lik[i] += categorical_logit_lpmf(choice[i, bIdx, t] | ev * beta[i]);

      // generate posterior prediction for current trial
      y_pred[i, bIdx, t] = categorical_rng(softmax(ev * beta[i]));

      // Prediction Error
      PE = outcome[i, bIdx, t] - ev[choice[i, bIdx, t]];

      // Store values for model regressors
      ev_c[i, bIdx, t]   = ev[choice[i, bIdx, t]];
      ev_nc[i, bIdx, t]  = ev[1 - choice[i, bIdx, t]];
      pe[i, bIdx, t]     = PE;

      // Update expected value of chosen stimulus
      ev[choice[i, bIdx, t]] += alpha[i] * PE;
    }
  }
}

}
}

You are doing leave-one-subject-out cross-validation and at the same time have two parameters for each subject. Removing the all observations for one subject makes the posterior for those two parameters to change a lot. See LOO package glossary — loo-glossary • loo and Roaches cross-validation demo Section 4. You could solve this by using integrated loo as in Roaches cross-validation demo Section 5. As you have two parameters per subject, it is a bit trickier to do the integration but someone did report getting it to work in Stan discourse (sorry, I don’t have time to search that thread)

1 Like

Thank you for the reply. I will try your suggestion.
While reading Roaches cross-validation demo Section 5, I was confused whether I can apply the integration to a model with more than two parameters. My other models have 3 to 4 parameters and they also have the high pareto k problem. Can I do the integration when a model has more than 2 parameters?

Before I received your answer, I found that someone who had similar problem solved this problem of high pareto k values by modifying the code to do leave-one trial of one participant-out cross validation. I tried it as below:



data {
  int<lower=1> N;                          // Number of subjects

  int<lower=1> B;                          // Maximum number of blocks across subjects
  int<lower=1> Bsubj[N];                   // Number of blocks for each subject

  int<lower=0> T;                          // Maximum number of trials across subjects
  int<lower=0, upper=T> Tsubj[N, B];       // Number of trials/blocks for each subject

  int<lower=-1, upper=2> choice[N, B, T];  // The choices subjects made
  real outcome[N, B, T];                   // The outcome
}

transformed data {
  // Default value for (re-)initializing parameter vectors
  vector[2] initV;
  initV = rep_vector(0.0, 2);
}

// Declare all parameters as vectors for vectorizing
parameters {
  // Hyper(group)-parameters
  vector[2] mu_pr;
  vector<lower=0>[2] sigma;

  // Subject-level raw parameters (for Matt trick)
  vector[N] alpha_pr;   // learning rate 
  vector[N] beta_pr;   // inverse temperature
}

transformed parameters {
  // Transform subject-level raw parameters
  vector<lower=0, upper=1>[N] alpha;
  vector<lower=0, upper=10>[N] beta;

  for (i in 1:N) {
    alpha[i]  = Phi_approx(mu_pr[1] + sigma[1] * alpha_pr[i]);
    beta[i]  = Phi_approx(mu_pr[2] + sigma[2] * beta_pr[i]) * 10;
  }
}

model {
  // Hyperparameters
  mu_pr  ~ normal(0, 1);
  sigma ~ normal(0, 1);

  // individual parameters
  alpha_pr ~ normal(0, 1);
  beta_pr ~ normal(0, 1);

  for (i in 1:N) {
    for (bIdx in 1:Bsubj[i]) {  // new
      // Define Values
      vector[2] ev;// Expected value
      real PE;        // Prediction error

      // Initialize values
      ev = initV;     // Initial ev values

      for (t in 1:Tsubj[i, bIdx]) {
        // Softmax choice
        choice[i, bIdx, t] ~ categorical_logit(ev * beta[i]);

        // Prediction Error
        PE = outcome[i, bIdx, t] - ev[choice[i, bIdx, t]];

        // Update expected value of chosen stimulus
        ev[choice[i, bIdx, t]] += alpha[i] * PE;
        
      }
    }
  }
}

generated quantities {
  // For group level parameters
  real<lower=0, upper=1> mu_alpha;
  real<lower=0, upper=10> mu_beta;

  // For log likelihood calculation
  real log_lik[N, B, T];

  // For model regressors
  real ev_c[N, B, T];   // Expected value of the chosen option
  real ev_nc[N, B, T];  // Expected value of the non-chosen option
  real pe[N, B, T];     // Prediction error

  // For posterior predictive check
  real y_pred[N, B, T];

  // Initialize all the variables to avoid NULL values
  for (i in 1:N) {
    for (b in 1:B) {
      for (t in 1:T) {
        ev_c[i, b, t]  = 0;
        ev_nc[i, b, t] = 0;
        pe[i, b, t]    = 0;

        y_pred[i, b, t]   = -1;
        log_lik[i, b, t] = 0;
      }
    }
  }

  mu_alpha = Phi_approx(mu_pr[1]);
  mu_beta = Phi_approx(mu_pr[2]) * 10;

  { // local section, this saves time and space
    for (i in 1:N) {

      for (bIdx in 1:Bsubj[i]) {  // new
        // Define values
        vector[2] ev;// Expected value
        real PE;   

        // Initialize values
        ev = initV; // initial ev values

        for (t in 1:Tsubj[i, bIdx]) {
          // Softmax choice
          log_lik[i, bIdx, t] = categorical_logit_lpmf(choice[i, bIdx, t] | ev * beta[i]);

          // generate posterior prediction for current trial
          y_pred[i, bIdx, t] = categorical_rng(softmax(ev * beta[i]));

          // Prediction Error
          PE = outcome[i, bIdx, t] - ev[choice[i, bIdx, t]];

          // Store values for model regressors
          ev_c[i, bIdx, t]   = ev[choice[i, bIdx, t]];
          ev_nc[i, bIdx, t]  = ev[1 - choice[i, bIdx, t]];
          pe[i, bIdx, t]     = PE;

          // Update expected value of chosen stimulus
          ev[choice[i, bIdx, t]] += alpha[i] * PE;
        }
      }
    }
  }
}

Now the pareto k result is better than before, but still there are some very bad k values. And the warning says: Can’t fit generalized Pareto distribution because all tail values are the same.
I cannot understand why this warning appears. Could you please help me if you know why this warning appears and how to solve this problem?

Thank you so much

Computed from 12000 by 4920 log-likelihood matrix

         Estimate   SE
elpd_loo  -1872.4 41.6
p_loo        51.6  2.4
looic      3744.8 83.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     4756  96.7%   4149      
 (0.5, 0.7]   (ok)          0   0.0%   <NA>      
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)  164   3.3%   6000      
See help('pareto-k-diagnostic') for details.

In theory yes, but in practice it gets very difficult as 1) the number of evaluation points in the quadrature grow exponentially with the number of dimensions and in this case it is likely that some adaptive quadrature is needed so it’s not easy to keep the number of quadrature points small, 2) the implementation in Stan using nested 1D quadrature would be cumbersome.

That is often more stable as you are leaving less data out, and if you don’t need leave-one-person-out, then go for it.

Look at the trials corresponding to the high Pareto k’s and warnings, and it is likely that you see that those trials are somehow special or some persons have only one trial.

2 Likes