Loo for hierarchical model with trial-by-trial dependencies

Hi everyone,

I’m using Stan to fit a reinforcement learning (RL) model to behavioral data, where each participant is modeled hierarchically with subject-level parameters. I want to compare different models based on their goodness-of-fit.

However, in RL tasks, choices on a given trial depend on choices and outcomes from previous trials. As I understand it, this violates the assumption of conditional independence required for computing pointwise log-likelihoods at the trial level for PSIS-LOO.

  1. Is it inappropriate to use loo() on a log-likelihood matrix defined per trial in such RL models?
  2. What would be a better alternative for model comparison in this context? I saw a suggestion to use the sum of log-lik per participant (which keeps only dependence due to the hierarchy), but that causes high K-pareto values.
2 Likes

@avehtari has thought more about this particular type of situation than I have, so he may have a better answer, but it sounds like you could potentially use either leave-future-out CV (LFOCV), which is specifically designed for time-series, or K-fold CV if you define the folds in a way that respects the structure of your data (e.g. leave one participant out, although this could costly if you have a lot of participants). We have vignettes in the loo package for both of these topics and a paper on LFOCV:

These methods take a bit more work to implement than just calling the loo() function.

Yeah this can sometimes lead to high Pareto k values because if you leave out the entire data for a participant it could have a larger impact on the posterior distribution than just leaving out one of the data points from that participant, especially if that participant’s data is particularly informative or different from the rest. In that case the importance sampling can be unreliable.

1 Like

The conditional independence is not required. See CV-FAQ 7 When is cross-validation valid?. See also CV-FAQ 9 Can cross-validation be used for time series?

@jonah suggested LFO-CV, but as mentioned in CV-FAQ 9 Can cross-validation be used for time series?, K-fold-CV with joint log-score can be better in model comparison (Cooper et al., 2023). Also I would assume that in RL task, the time series is not assumed to continue similar for ever, which mean LFO-CV is not able to provide low bias estimate for the future predictive performance either.

It’s not inappropriate and can be sufficient, but other methods may be more efficient in model selection or have smaller bias for the predictive performance.

It depends on your goals

  • You want to do model selection and care about the predictive accuracy in time for these same participants: K-fold-CV with joint log score (Cooper et al., 2023)
  • You care about the predictive performance for new participants: leave-one-participant-out, which you did try.

The sum of log-lik per participant is correct in theory, but if you get high Pareto-k values, it means that it is very difficult to predict a whole trial for a new participant. It’s easiest to do this using brute force leave-one-out-participant cross-validation, or if you have many participants do K-fold-CV where in each fold you leave out more than one participant. loo package has helper functions for forming the folds Helper functions for K-fold cross-validation — kfold-helpers • loo, and there is a vignette for K-fold-CV with Stan Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

2 Likes

Good point! Thanks Aki.

Hi Aki,

Thank you!

I have 79 participants, and I’m considering a 10-fold cross-validation. My question then becomes how to properly account for the hierarchy. Currently, each participant has their own parameters using a non-centered parameterization.

When calculating log_lik on the “test set” using generated quantities, should the test participants receive the population-level parameters? Or is there a different approach you’d recommend for maintaining the hierarchical structure?

Sounds fine

Population-level parameters plus the participant specific parameters drawn from the prior. Think of predicting for a new participant. As you don’t know their parameters, you draw those parameters from the prior. If you share your model code, I can be more specific.

1 Like

Hey Aki,

Thanks again.

I wondered whether exchangeability is necessary for me to calculate loo. I may be misinterpreting the definition from the FAQ, but it seems like my standard RL model is non-exchangeable, as the order of trials would impact the the Q-value which is used as a predictor for choice. That is, if I shuffle the order of input vectors to stan (such as rewards and choices) I will get different Q-values. The prediction in the model block is choice~delta_Q_value, where Q_value is calculated for each trial on the transformed_parameters block.

I attach my stan code for reference.

Thanks a lot!

data {
  
  int<lower=1> Ndata;                // Total number of trials (for all subjects)
  int<lower=1> Nsubjects; //number of subjects

  int<lower=2> Narms; //number of overall alternatives

  array [Ndata] int first_trial; // Which subject performed each trial
  array [Ndata] int<lower=1, upper=Nsubjects> subject_trial; // Which subject performed each trial

  //Behavioral data:

  array[Ndata] int<lower=0> ch_card; //index of which card was chosen coded 1 to 4

  array[Ndata] int<lower=0> reward; //outcome of bandit arm pull

  array[Ndata] int<lower=0> card_left; //offered card in left bandit (2 out of 4 are offered on each trial)

  array[Ndata] int<lower=0> card_right; //offered card in right bandit

  array [Ndata] int <lower=0,upper=1> first_trial_in_block; // binary indicator

  array[Ndata] int<lower=0> selected_offer; // which option was chosen ( 0 for left, 1 for right)

}



parameters {
    // Group-level (population) parameters
  real mu_alpha;        // Mean learning rate across subjects
  real mu_beta;          // Mean inverse temperature across subjects
  
  // Group-level standard deviations (for subject-level variability)
  real<lower=0> sigma_alpha;          // Variability in learning rate
  real<lower=0> sigma_beta;          // Variability in inverse temperature
  
// Non-centered parameters (random effects in standard normal space)
  vector[Nsubjects] alpha_raw;
  vector[Nsubjects] beta_raw;

}


transformed parameters {
  vector<lower=0,upper=1>[Nsubjects] alpha_sbj; // learning rate
  vector[Nsubjects] beta_sbj; // Threshold
  
 alpha_sbj = inv_logit(mu_alpha +sigma_alpha* alpha_raw);
 beta_sbj = mu_beta +sigma_beta* beta_raw;

  real alpha_t;
	real beta_t;					 
  real PE_card;
  vector [Ndata] Q_cards_diff;
  vector [Narms] Q_cards;


    for (t in 1:Ndata) {
    alpha_t = alpha_sbj[subject_trial[t]];
		beta_t = beta_sbj[subject_trial[t]];

  if (first_trial_in_block[t] == 1) {
      Q_cards=rep_vector(0.5, Narms);
    }

        Q_cards_diff[t]=beta_t*(Q_cards[card_right[t]]-Q_cards[card_left[t]]);
 //calculating PEs
 PE_card =reward[t] - Q_cards[ch_card[t]];
 
//Update values ch_key=1 means choosing right
 Q_cards[ch_card[t]] += alpha_t * PE_card; //update card_value according to reward
}
}

model {
  
  // Priors for group-level parameters
  mu_alpha ~ normal(0, 3);
  mu_beta ~ normal(0, 3);
  
  // Priors for group-level standard deviations
  sigma_alpha ~ normal(0, 2);
  sigma_beta ~ normal(0, 2);
  
  // Priors for subject-specific effect
  alpha_raw~normal(0,1);
  beta_raw~normal(0,1);
     target+= bernoulli_logit_lpmf(selected_offer| Q_cards_diff);

}

generated quantities {
  vector[Ndata] log_lik;
   for (t in 1:Ndata) {
     log_lik[t]= bernoulli_logit_lpmf(selected_offer[t] | Q_cards_diff[t]);
  }
}



I wanted to take the time to go through a full program and provide comments on naming and commenting style. Hope this isn’t too presumptuous on my part.

My topmost comment is that I think you’ll find your code easier to read with fewer comments and fewer vertical spaces. And keeping indentation aligned—otherwise it’s too hard to see where scopes and loops/conditionals end. Don’t use tabs as you never know how they’ll render.

It’s usually better to choose an informative variable name rather than a uninformative one and then provide doc. Ndata is too generic here, as there is a lot of data in the data block, hence the need for doc.

int<lower=1> Ndata;                // Total number of trials (for all subjects)

Instead, I would suggest removing the doc and renaming the variable for clarity.

int<lower=0> num_trials; 

Also, I switched lower bound to zero. Usually you want to allow zero data—that lets you do prior predictive inference by supplying empty data without changing the model.

I’d suggest putting <lower=1, upper=num_subjects> on first_trial and renaming it if the doc is right that it’s which subject performed each trial. It helps document the range of values so you don’t need to do that in the doc itself.

  array[Ndata] int<lower=0> ch_card; //index of which card was chosen coded 1 to 4

If this has to take values 1 to 4 then it should be <lower=1, upper=4>. Then you don’t need the doc—it’s built into the declaration. If you name the variable chosen_card you can get rid of the rest of the doc.

The variables subject_trial and first_trial have the same doc, so presumably one is wrong, or more is needed to distinguish them.

If you really feel that mu_alpha is too confusing, call it mean_learning_rate. I don’t think it really is mean learning rate across subjects though, I think it’s the population-level location parameter, which doesn’t have to work out to the mean across subjects, though it sometimes will in simple cases.

The parameters you document as “Non-centered parameters” is in fact the raw parameters, not the non-centered ones. The non-centered ones are alpha_sbj and beta_sbj. You can do this all more easily by declaring

parameters { 
  vector<offset=mu_beta, multiplier=sigma_beta>[NSubjects] beta;
  ...
model {
  beta ~ normal(mu_beta, sigma_beta);

I replaced beta_sbj with just beta, but that should be called something meaningful if beta isn’t well known as a conventional name in this model.

I found the setting and unsetting of Q_cards quite confusing. This would be much cleaner if you stored the starting index of each block, then iterated through the blocks. Then you could just initialize at the top of the loop.

Speaking of initialization, it’s best to define that rep_vector in generated data block and reuse).

This is the kind of comment I’d suggest just eliminating as it’s just saying what the code is doing:

//calculating PEs
 PE_card =reward[t] - Q_cards[ch_card[t]];

Similarly, the priors comments in the model block aren’t telling us anything we can’t derive from looking at the variables. Here they’re correctly named as “group-level parameters”.

1 Like

The exchangeability is about the data or parameters. Most of the time we have such additional information that we need to use partial or conditional exhangeability (see Section 5.2 in BDA3).

For example, in case of time series data, we might have data (y_1, \ldots, y_T) in which case the observations are not exchangeable as the subscript labels (1, \ldots, T) have additional information. But if we use a model

\begin{aligned} y_1 & \sim \mathrm{normal}(\mu_0, \sigma_0) + \epsilon_1\\ y_t & \sim f(y_1,\ldots,y_{t-1},t) + \epsilon_t\\ \epsilon_t & \sim \mathrm{normal}(0, \sigma_\epsilon), \end{aligned}

then residuals \epsilon_t are exchangeable, and LOO-CV is valid and focusing specifically on the residual model part (which will naturally be affected by the goodness of the other model, too). Most models can be written in such a form that exchangeability comes up, although not necessarily in a factorized form as here. A factorized form makes it easy to compute the necessary log_lik values. See more about LOO for non-factorized models in Efficient leave-one-out cross-validation for Bayesian non-factorized normal and Student-t models | Computational Statistics.

I’m not claiming that your data and model would be such that you would easily get factorized form for the likelihood, and I started clarifying that we don’t require conditional independence assumption (although assuming partial or conditional exchangeability we often can proceed as if we had conditional independence, see BDA3 Section 5.2, but we can proceed even if we do take into account a dependency).

So the question is whether you assume pairs selected_offer[t], Q_cards_diff[t] to be exchangeable, and if not, then what information you should include to make it exchangeable?