High Pareto k̂ values only for some datapoints for some subjects

I’m working on a task where participants make risky choices on a continuous scale (0 to 50) across 20 trials. My model includes standard risk-related parameters and allows them to drift over trials so each parameter has a baseline value and a corresponding drift term.

The issue I’m facing is with high Pareto k̂ values, but only for some subjects, and even then, only for a subset of their trials, sometimes just a single trial.

I’m also running around 15 variants of the same template model for model comparison purposes. This includes models that are intentionally less predictive, to establish a benchmark against the best-performing (winning) model.

I’ve tried mitigating this by:

  • Adding priors to all base and drift parameters,

  • Increasing randomness

  • And inspecting the problem trials more closely.

Still, I’m unsure how best to handle these local issues without discarding data. Any advice or insights would be greatly appreciated! I assume in this case it would not be acceptable to exclude outliers from data analysis? As this might be related to low number of trials I have per subjects which gives more weight to extreme trials. I would not want to remove these datapoints if I don’t have to as it might be related to exploration strategy by the participant, and there is very few data already (20 trials only per sub).

Thanks!

My stan code:

MODEL_TEMPLATE = “”"
data {{
int<lower=1> N; // number of observations
int<lower=1> B; // number of bet options
vector[N] xmult; // multiplier values
array[N] int<lower=0, upper=B-1> bet; // chosen bet (0-indexed to match your data)
vector[N] trial_scale; // scaled trial numbers for drift
}}

parameters {{
{params}
}}

transformed parameters {{
vector[N] theta = {theta_block};
vector[N] rho = {rho_block};
vector[N] beta = {beta_block};
{clamp}
}}

model {{
{priors}

// Likelihood
for (n in 1:N) {{
    vector[B] U;
    for (b in 0:(B-1)) {{  // 0-based loop to match data
        real k = fmax(50 - b, 1e-6);
        real g = fmax(0.5 * (xmult[n] * b), 1e-6);
        if (abs(rho[n] - 1.0) < 1e-5) {{
            U[b+1] = pow(k, theta[n]) * pow(g, 1 - theta[n]);  // Store in 1-based array
        }} else {{
            real temp = theta[n] * pow(k, rho[n]) + (1-theta[n]) * pow(g, rho[n]);
            U[b+1] = pow(fmax(temp, 1e-10), 1.0 / rho[n]);
        }}
    }}
    target += beta[n] * U[bet[n]+1] - log_sum_exp(beta[n] * U);  // bet[n] is 0-based, U is 1-based
}}

}}

generated quantities {{
vector[N] log_lik;
for (n in 1:N) {{
vector[B] U;
for (b in 0:(B-1)) {{ // 0-based loop to match data
real k = fmax(50 - b, 1e-6);
real g = fmax(0.5 * (xmult[n] * b), 1e-6);
if (abs(rho[n] - 1.0) < 1e-5) {{
U[b+1] = pow(k, theta[n]) * pow(g, 1 - theta[n]);
}} else {{
real temp = theta[n] * pow(k, rho[n]) + (1-theta[n]) * pow(g, rho[n]);
U[b+1] = pow(fmax(temp, 1e-10), 1.0 / rho[n]);
}}
}}
log_lik[n] = beta[n] * U[bet[n]+1] - log_sum_exp(beta[n] * U); // bet[n] is 0-based, U is 1-based
}}
}}
“”"

model_variants = [
{
“name”: “6_theta_rho_noise_drift_theta_drift_rho_drift_noise_corrected_2”,
“params”: “real<lower=0.01, upper=0.99> theta0;\n real drift_theta;\n real<lower=0.01, upper=0.99> rho0;\n real drift_rho;\n real<lower=0.01, upper=5> beta0;\n real drift_beta;”,
“theta_block”: “theta0 + drift_theta * trial_scale”,
“rho_block”: “rho0 + drift_rho * trial_scale”,
“beta_block”: “beta0 + drift_beta * trial_scale”,
“clamp”: “for (n in 1:N) {{ theta[n] = fmin(fmax(theta[n], 0.01), 0.99); rho[n] = fmin(fmax(rho[n], 0.01), 0.99); beta[n] = fmax(beta[n], 0.01); }}”,
“priors”: “theta0 ~ beta(2, 2);\n rho0 ~ beta(2, 2);\n beta0 ~ gamma(0.5, 2);\n drift_theta ~ normal(0, 0.09);\n drift_rho ~ normal(0, 0.09);\n drift_beta ~ normal(0, 1.0);”
},
]

Can you show the whole loo output? If p_loo is relatively high, it is possible that the model is well specified (no outliers) but flexible. See CV-FAQ 17. Based on your model description, it sounds like it is a flexible model. In this case, it would be best to improve computation. CV-FAQ 17 lists also ways to improve computation. With brms these would be very easy, but with model written directly in Stan, you need to do some additional work, but these loo package vignettes should help

CES_Model16_6_theta_rho_noise_drift_theta_drift_rho_drift_noise_2_0.05:
  Avg rank: 4.55 ± 2.81 (n=49)
  p_loo (effective params): 3.17 ± 1.34
  PSIS-LOO k̂: avg=0.581, worst=1.456
  PSIS reliability: 1/49 subjects (2.0%)
  LOO warnings: 11/49 subjects (22.4%)
  Overall reliability: 38 reliable, 9 caution, 2 unreliable

p_loo is small compared to the number of observations, and at least one of the khats is larger than 1, which indicates model misspecification, and data are more overdispersed than the predictive distributions.

Since you have only 49 subjects, you could probably do even brute force LOO-CV

Thank you! Excuse my lack of knowledge in this area, but how can I test the reliability of brute force LOO-CV? When I did it, the winning model(s) were compeltely different than winning models using PSIS LOO, leading to a completely different conclusion

Brute force is the thing that PSIS-LOO is trying to approximate, and in this sense it’s always reliable. With that said, the inference that you get from any model selection procedure, including brute-force leave-one-out cross validation, is only as good as the quality/appropriateness of the models over which it operates and the data over which the models are fit and evaluated. That is, just because brute force LOO selects a particular model doesn’t mean that model is any good (all candidate models could be bad), and poor-quality or uncharacteristic data can mislead you into preferring a model that wouldn’t work well for high-quality or typical data.

With brute force I referred to still using MCMC, and use of Monte Carlo is not necessarily reliable. Figure 2 in Implicitly adaptive importance sampling paper shows that brute force LOO-CV with rerunning MCMC can produce results far away from the exact analytical solution.

You did get some bad Pareto-k’s, so it is possible that the results are opposite.

It would help if you would also show the performance differences and associated SEs.