PSIS-LOO in hierarchical model - only mean log-likelihoods per group work

Hi, I am running a complex hierarchical model consisting of trialwise behavioural data across 21 participants, with a large number of total trials (about 15,000). When I run psis_loo on the trialwise log-likelihoods, almost all the pareto-k values are too high. However, in my case it makes more sense to assess the predictive accuracy for an out-of-sample participant, not an out-of-sample trial.
I have now tried running the psis_loo on the summed log-likelihoods across trials, for each participant, and that again produces very high pareto-k values. When I instead use the mean log-likelihoods across trials, all pareto-k values are good. Is the fact that the summed log-likelihoods are large causing this? Is it valid to use the mean this way?

I should add that I also have different numbers of trials per participant (though not by too much), so perhaps this throws things off when using the sum instead of the mean?

Thanks so much for the help!

1 Like

Marika, I am not sure about the sum vs mean issue. But I think use of log-likelihoods that involve random effects removes some theoretical guarantees about metrics like waic and psis-loo. This is because it leads to a situation where the number of parameters (random effects) increases with sample size (adding more participants).

I think you are on the right track to calculate a single log-likelihood value for each participant. I think the best thing to do is integrate out the participant random effects instead of sum/average across trials. But the integration might be very tricky depending on your model. Here is a paper describing these issues in more detail: https://arxiv.org/pdf/1802.04452

I missed your question earlier

Can you tell more about the model? e.g. share Stan code?

As you have only 21 participants, you could do 21-fold-CV. With that few participants the uncertainty in the CV estimate is likely to be underestimated (see [2008.10296v3] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison), and although CV results can be useful, you may need to do additional permutation tests.

no

If you want to have equal weighting for the participants, you would need to do the weighting in density scale and not in log-density scale

Absolutely, and thanks a lot for the input! I am trying to model data from an experiment in which participants saw two consecutive stimuli and made a decision about each, and then rated their confidence. The second decision as well as confidence incorporate information from both stimuli, and the weighting of the information from stimulus 1 is the parameter I care about (m). This weighting is captured separately for the second decision (m1) and for confidence (m2). The stimuli each lead to an internal signal that we cannot measure, so I essentially have two latent variables on every trial (r1 and r2), and the second decision (probability of choosing right, ‘probR’) as well as confidence both depend on these in complex ways. So, unfortunately, the whole thing has gotten quite messy and slow. I compute the probability of choosing “right” in the second decision with the integration.

Sorry if this is too much detail, I find it tricky to know what is relevant to include.

functions{

  real probR_density_R(real x,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i ) {     // data (integer)
    
    real m = theta[1];
    int L = x_i[1];
    int max_val = 24*L;
    int min_val = 1;
    real temp = theta[2];
    int range = (max_val - min_val+1)/2; 
    int mid_pt = min_val + range;
    int out;
    real s1;
    real s2;

    while(range > 0) {
      if(temp == mid_pt){
        out = mid_pt;
        range = 0;
      } else {
        // figure out if range == 1
        range =  (range+1)/2; 
        mid_pt = temp > mid_pt ? mid_pt + range: mid_pt - range; 
        }
    }

    s1 = x_r[out];
    s2 = x_r[(24*L)+out];

    return erf((x/m + s2)/sqrt(2)) * exp(-0.5*((x-s1))^2);
  }

  real probR_density_L(real x,          // Function argument
                    real xc,         // Complement of function argument
                                     //  on the domain (defined later)
                    real[] theta,    // parameters
                    real[] x_r,      // data (real)
                    int[] x_i ) {     // data (integer)
    real m = theta[1];
    int L = x_i[1];
    int max_val = 24*L;
    int min_val = 1;
    real temp = theta[2];
    int range = (max_val - min_val+1)/2; // We add 1 to make sure that truncation doesn't exclude a number
    int mid_pt = min_val + range;
    int out;
    real s1;
    real s2;

    while(range > 0) {
      if(temp == mid_pt){
        out = mid_pt;
        range = 0;
      } else {
        // figure out if range == 1
        range =  (range+1)/2; 
        mid_pt = temp > mid_pt ? mid_pt + range: mid_pt - range; 
        }
    }

    s1 = x_r[out];
    s2 = x_r[(24*L)+out];

    return erf((s2 - x/m)/sqrt(2)) * exp(-0.5*((x-s1))^2);
  }
   
  real partial_sum(int [] trials, int start, int end, int [] choice, real [] conf, vector probR, vector model_conf){
      return bernoulli_lpmf(choice[start:end] | probR[start:end]) + normal_lpdf(conf[start:end] | model_conf[start:end], 0.025); 
  }
}

data {
  int<lower=0> N; // total number of trials
  int<lower=1> L; // number of participants
  int<lower=0,upper=L> ll[N];
  int<lower=0,upper=L> intMap[24*L];
  real stim1[N]; // coherence of stimulus 1
  real stim2[N]; // coherence of stimulus 2
  int<lower=0,upper=1> choice1[N]; //participants' responses to decision 1
  int<lower=0,upper=1> choice[N]; // participants' responses to decision 2
  real<lower=0.5,upper=1> conf[N]; // participants' confidence ratings
  int trials[N];
  real stim1Int[24*L];
  real stim2Int[24*L];
  int levels[N];
  int<lower=1> grainsize;
}

transformed data{
  int x_i[1] = {L};
  real x_r[2*24*L] = append_array(stim1Int, stim2Int);
}

parameters {
  real<lower=0> m1_mu; // weighting of stimulus 1 in decision 
  real<lower=0> m2_mu; // weighting of stimulus 2 in confidence
  real<lower=0> b_mu; // confidence bias
  real<lower=0> m1_sd;
  real<lower=0> m2_sd;
  real<lower=0> b_sd;

  // define subject-level parameters
  real<lower=0> m1[L]; 
  real<lower=0> m2[L];
  real<lower=0> b[L];

  vector[N] r1; // internal representation for stimulus 1
  vector[N] r2; // internal representation for stimulus 2
}

transformed parameters{
  vector[N] probR;
  vector[N] model_conf; 

    vector[24*L] integrals_R;
    vector[24*L] integrals_L;
    real<lower=0,upper=1> confPrior;
    real<lower=0,upper=1> confR;
    real<lower=0,upper=1> confL;
    real<lower=0> m1_subj;
    real ind;
    real temp;

    for (i in 1:(24*L)){ 
      m1_subj = m1[intMap[i]];
      ind = i;
      integrals_R[i] = 0.5 + (1/(2*Phi(stim1Int[i])))   * (1/(sqrt(2*pi()))) *
          integrate_1d(probR_density_R, 0,positive_infinity(),{m1_subj, ind}, x_r , x_i, 1e-8); //
        
        integrals_L[i] = 0.5 + (1/(2*(1-Phi(stim1Int[i])))) * (1/(sqrt(2*pi()))) *
          integrate_1d(probR_density_L,
                                 negative_infinity(),
                                 0,
                                 {m1_subj, ind}, x_r, x_i, 1e-8); //
    }

    for (i in 1:N){
        if (choice1[i]==1){
          probR[i] = integrals_R[levels[i]];
        }else{
          probR[i] = integrals_L[levels[i]];
        }
        
        confPrior = Phi_approx(fabs(r1[i])/(b[ll[i]]*m2[ll[i]]));

        temp = r2[i]/(b[ll[i]]*sqrt(2));
        if (temp <= -4){
          temp = -4;
        }

        confR = (confPrior*(1+erf(temp)))/(1 + (2*confPrior-1)*(erf(temp)));
        confL = 1 - confR;

        if (choice[i]==1){
          model_conf[i] = confR;
        }else{
          model_conf[i] = confL;
        }
    }
}

model {
    // priors 
    m1_mu ~ lognormal(1,1); 
    m2_mu ~ lognormal(1,1); 
    b_mu ~ lognormal(1,1); 

    r1 ~ normal(stim1, 1);  
    r2 ~ normal(stim2, 1);

    m1_sd ~ lognormal(1,1); 
    m2_sd ~ lognormal(1,1);
    b_sd ~ lognormal(1,1);

    m1 ~ normal(m1_mu,m1_sd); 
    m2 ~ normal(m2_mu,m2_sd); 
    b ~ normal(b_mu,b_sd); 
    
    target += reduce_sum(partial_sum, trials, grainsize, choice, conf, probR, model_conf);
}

Ok, great thanks. I tried a first attempt at a Kfold-CV but leaving one trial out, which lead to very high variance. I haven’t tried LOGO-CV yet, so I will look into this more and give that a try.

Great, thanks, that makes sense, I will do that.

Ok, thanks a lot for the input! I have read this paper and it has been very helpful so far, but unfortunately you’re right that the integration was looking very nasty with my model. But perhaps I will have to give that another go.

This is perfect. The code helps to see the main issue, and it’s always exciting when people are constructing models that are beyond linear models.

As there are r1 and r2 for each trial, any importance sampling leave-one-trial-out-cv (or waic) is in trouble as you remove the only observation directly influencing these (unless the prior on these would be really narrow, but the prior scale is fixed to 1). Leave-one-person-out-cv is also challenging as you leave out all observations related to many r1 and r2, and to individual specific parameter. K-fold-cv is still feasible. See [2008.10296v3] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison and Gaël Varoquaux. Cross-validation failure: Small sample sizes lead to large error bars. NeuroImage, 180:68 – 77, 2018. and Gaël Varoquaux, Pradeep Reddy Raamana, Denis A. Engemann, Andrés Hoyos-Idrobo, Yannick Schwartz, and Bertrand Thirion. Assessing and tuning brain decoders: Crossvalidation, caveats, and guidelines. NeuroImage, 145:166 – 179, 2017. for discussion of issues with small number of folds (ie 21 in your case)

Your model is very flexible (overparameterized) with vague priors, and you have only 21 subjects, so I’m not surprised. Do you really need both r1 and r2, and do you really need to have a separate value for each trial, and would there be some more informative prior? Bernoulli model is providing less information than continuous value observations, and you try to estimate two parameters (r1 and r2) with less than one unit information. If you don’t have any information that would tell how similar different trials are, you could at least change

    r1 ~ normal(stim1, 1);  
    r2 ~ normal(stim2, 1);

to have a scalar scale parameters instead of 1, so that there is some chance of shrinking r1 and r2 towards stim1 and stim2.

r1 and r2 reflect internal perceptual signals caused by the stimuli on every trial but with added internal processing noise, so they are really both included on every trial for conceptual reasons. There is definitely some potential to constrain at least r1 with an additional choice outcome that I am not currently using in the model. It would add this line: choice1 ~ bernoulli(Phi(r1)). I gave that an initial try and it doesn’t help things out much though. I also think I can use some other data from a control task for some more informative priors for r1 and r2, or if that doesn’t work I will try a scalar scale parameter.

But, actually, I also tried out just running it without any r1 and r2 at all, just using stim1 and stim2 as the internal signals, and actually it works really nicely. It runs way faster and fixes the treedepth issue I was having before (I guess unsurprisingly given how many parameters that removes). And, more surprisingly, after playing around with some parameter recovery from simulated data, it also seems to recover my other parameters well regardless. So, maybe it will be justifiable to remove them. The leave-one-trial-out CV still has several Pareto k values above 0.7, but looks way better than before. I will give 21-fold LOGO CV a try from this simpler model and see how that goes.

Thanks a lot for the input! I had a sense that r1 and r2 were the culprits, but I have a much clearer grasp of exactly why they cause these issues now.

1 Like

That is great, thanks! I will look into these before trying out a 21-fold CV, and hopefully if I can simplify my model by getting rid of r1 and r2, it will work better.

1 Like