I have extracted the pareto statistics and plotted my data with the high-pareto points highlights. I found this quite informative and I can see that they are outliers and as such will lead to problems.

Is there an agreed-upon way to proceed?

I have seen some talk of moment matching, although that appears to be a bit fiddly to implement when using cmdstanr?

Is it valid to remove these points as outliers and refit the model? This seems questionable. I am hoping to use LOO model comparison on three different models and I assume they will all have slightly different high pareto points.

I could try improving my model, but I don’t really have any good ideas about how to proceed in this case. Perhaps using a distribution with heavier tails might make these outlier points less influential. Or a mixture model that allows some small % chance of an outlier to be generated (this feels like a bit increase in complexity!)

Have you seen Cross-validation FAQ? Happy to get feedback whether you think it’s a relatively jargon-full or -free.

Agreed-uopn way is to proceed with care

Only if you can verify that the values for these observations are clearly misrecorded and you think there is no need to model the process that sometimes causes erranous values. For example, if you would have data with human ages above 120 years, or human heights over 3m, you probably don’t have a good model for the process causing these or it is not interesting to model those.

If the values are rare but feasible, then it would be better to include them, and either improve the model or improve the LOO-CV computation.

It would help if you would show the complete LOO output, as you have now removed the information on how many posterior draws did you use for computation (and some other useful information). The new diagnostic recommendations in Pareto Smoothed Importance Sampling paper and implemented in the latest loo package version (and you have the older version now) don’t have anymore that ok range, and the threshold for bad depends on the number of posterior draws. If you did you have more than 2200 posterior draws, then the threshold is 0.7. If the 4 khat values are just slighlty larger than 0.7, you can try also sampling more posterior draws as the khat’s are estimated from finite number of draws, and the estimates can be just by chance over the threshold. Also if the elpd_loo doesn’t change when you increase the number of posterior draws, it is more likely that the estimates are stable. If the 4 khat values above 0.7 are close to 1, you need to worry more when comparing models with small difference in elpd_loo. If you show the results from the model comparison, I can also comment on what I think is small or big difference (discussed also in CV-FAQ).

If you tell what your current model and data are, we might be able to provide suggestions.

There is a reason to worry if those .5% values are close to 1. If those .5% values are close to 0.7 it is possible that the high values are just due to Monte Carlo error and it’s more likely that improved computation doesn’t change anything. There can’t be a simple rule and threshold for khat and percentage of bad values, and some care and thought is needed.

CV-FAQ and references mentioned there are the best information I know. I’m happy to get pointers to possible papers I have missed there.

I think the issue of “too technical” depends very much on the target audience. I find myself often in a limbo land - my background is in mathematics, but my research is in psychology and most psychologists aren’t very technical. I am currently trying to tackle some reviewer comments about a methods paper that I’ve written. I really want to avoid over complicating the paper, as that will confuse the main message.

The main message: many experiments in psychology make use of repeated trials over time. Most people assume these are independent repetitions, but this is clearly not generally the case. If we use a relatively simple asymptotic regression model, we can provide a better overview of the data. Depending on the research question, the model’s estimated asymptote may well be a more theoretically interesting quantity than the empirical mean.

I have fit these models using Stan, but I assume I could use a frequentist method too.

The current version of my model is:

// multi-level asymptotic regression
functions{
real compute_mu_asym(
real t, real t0, int kf, int kr, int ll,
real a, real b, real r, matrix u) {
// this function computes mu.
real mu;
// used to store fixed + random effects
real A, B, Rho;
A = a + u[3*(kr-1) + 1, ll];
B = b + u[3*(kr-1) + 2, ll];
Rho = inv_logit(r + u[3*(kr-1) + 3, ll]);
mu = A - (A - B) .* (1-Rho)^(t./t0);
return(mu);
}
}
data {
int<lower=0> N; // total number of data points
array[N] real t; // trial number
array[N] real y; // dep. var.
// condition information
int<lower = 1> Kf; // number of conditions (fixed)
array[N] int<lower = 1, upper = Kf> Xf; // condition info (fixed)
int<lower = 1> Kr; // number of conditions (random)
array[N] int<lower = 1, upper = Kr> Xr; // condition info (random)
int<lower = 0> J; // number of participants
array[N] int<lower = 1, upper = J> L; // participant id
// now input priors
real pr_a_mu;
real<lower=0> pr_a_sd;
real pr_b_mu;
real<lower=0> pr_b_sd;
real<lower=0> pr_r_sd;
real<lower=0> pr_sig_lambda; // sd of residual variance
real<lower=0> pr_sig_u_lambda; // sd of random effect variance
// hyper param
real<lower=0> t0;
}
parameters {
// asymptote
array[Kf] real a;
// intial value
array[Kf] real b;
// rate of change
array[Kf] real r;
// standard deviation
real<lower = 0> sigma;
// random effects
vector<lower=0>[3*Kr] sigma_u;
// declare L_u to be the Choleski factor of a 3*Kx3*K correlation matrix
cholesky_factor_corr[3*Kr] L_u;
// random effect matrix
matrix[3*Kr,J] z_u;
}
transformed parameters{
// fixed effect of rate of change (not actually used in model)
array[Kf] real rho;
rho = inv_logit(r);
// this transform random effects so that they have the correlation
// matrix specified by the correlation matrix above
matrix[3*Kr, J] u;
u = diag_pre_multiply(sigma_u, L_u) * z_u;
}
model {
// define priors for fixed effects
a ~ normal(pr_a_mu, pr_a_sd);
b ~ normal(pr_b_mu, pr_b_sd);
r ~ normal(0, pr_r_sd);
// priors for random effects
sigma_u ~ exponential(pr_sig_u_lambda);
L_u ~ lkj_corr_cholesky(1.5); // LKJ prior for the correlation matrix
to_vector(z_u) ~ normal(0,2);
// residuals
sigma ~ exponential(pr_sig_lambda);
/////////////////////////
// fit model
/////////////////////////
array[N] real mu;
for (ii in 1:N) {
mu[ii] = compute_mu_asym(
t[ii], t0, Xf[ii], Xr[ii], L[ii],
a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);
}
y ~ normal(mu, sigma);
}
generated quantities {
real pr_a = normal_rng(pr_a_mu, pr_a_sd);
real pr_b = normal_rng(pr_b_mu, pr_b_sd);
real pr_r = normal_rng(0, pr_r_sd);
vector[N] log_lik;
for (ii in 1:N) {
real mu;
mu = compute_mu_asym(
t[ii], t0, Xf[ii], Xr[ii], L[ii],
a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);
log_lik[ii] = normal_lpdf(y[ii] | mu, sigma);
}
}

I have updated the loo package, and my output now looks like:

Computed from 4000 by 12511 log-likelihood matrix.
Estimate SE
elpd_loo -2112.7 105.9
p_loo 308.2 8.6
looic 4225.4 211.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.2, 1.5]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 12507 100.0% 69
(0.7, 1] (bad) 4 0.0% <NA>
(1, Inf) (very bad) 0 0.0% <NA>

The four values listed as bad are 0.824 0.857 0.706 0.789

In one of the examples I am looking at these four points are all the initial trial. I can understand the intuition as to why they are problematic for estimating loo… if we removed them from the data set then the estimate of the “initial value” parameter (b) would likely change quite a bit.

(in the plot, each column is data from a different person. I have picked out the 4 problematic points in blue. Note that there are 47 participants in total in the dataset,

Do you know of any good examples of published papers in which these methods are being used? It would be valuable to see how other researchers are communicating all these methods and details to their communities.

I’m happy to share data and more of my processing code if that would be helpful.

If possible, I would love to make this paper a good example of “best practice” [at least, as much as my abilities will allow me!

I’m hoping to get this project wrapped up soon and everything appears to be so tantelising close to working!

From reading up on this some more, i could try moment-matching (I’ve had not had luck getting this to work with cmdstanr, but that is likely due to my own incompetance!).

Another option is to do 10-fold validation?

Thanks

(I am always impressed with how helpful this community is)

.706 could by chance due to Monte Carlo variation, others less likely

That’s a good analysis of the situation.

And nice plot. Based on these, if it’s not a problem that the first trial looks quite different then several following ones and that feature is not modelled well then, I think it is fine to continue with these 4, unless you are worried that they influence elpd, too much.

Can you clarify what you mean by “these methods”? I know many papers using PSIS-LOO, many mentioning Pareto k diagnostics, many using moment matching, reloo, or K-fold-CV to verify that high k’s are not a problem, but I don’t know papers discussing why proceeding with a few high k values is ok. Your explanation and plot would be excellent in a paper for justifying why these few high k values are not a problem, and that these all are initial trials which seem to behave differently anyway.

I think you are on a good track, and if you like I can have a quick look and provide comments before you submit.

Sorry to keep you in suspense while being busy with other things. I hope my answer didn’t come too late

Third option would be to remove these 4 observations, fit with everything else, and then compute the log predictive density for these 4 left out observations, and if the values would match with PSIS-LOO, then you would not need to worry. I think this would be probably the simplest option. And you could still tell about these 4 in the same way as above. See Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

I will certainly send a copy of the paper to you to have a quick look over before we resubmit.

A little bit of background: Our paper is a methods paper aiming at a mainstream (cognitive) psychology audience. The point we are making isn’t particularly novel: if you carry out an experiment with multiple trials per person, then these trials aren’t really independent. In a lot of cases, people’s performance improves over time. If we model this improvement (using a simple asymptotic regression method), we should end up with more meaningful measurements compared to just taking the mean.

The peer reviews have been overall very positive. The main thing I need to add on the modelling side is model comparison as the reviewers would like some quantitative stats to back out our qualitative claims.

I was hoping that LOO would “just work” and I fear that a long diversion into technical matters will distract from the central message of the paper. Most psychologists still stick to ANOVA + p-values and will be lost if there is a lengthy section on psis, pareto k statistics, etc (and, as I’m sure it is clear, my own understanding of some of these issues is a little murky!)