Loo_compare in the presence of high pareto-k

Proposal re comparing models despite some high pareto-k values.

We may be dealing with large datasets and complex models for which there is no simple way to eliminate high pareto-k values. In my work, a typical run-time is at least 24 hours so reloo is not feasible, nor is it practical to greatly extend the sampling in the hopes that all pareto-k < 0.7. loo_moment_match looks promising, but still has problems which the experts are looking at. These issues can arise even with small datasets and relatively simple models, as in

High values of pareto-k can also serve as some kind of diagnostic for outliers. I find the pareto-k outliers are similar to the points detected in mgcv by cooks.distance > T for some threshold.

So currently, we may not want or be able to eliminate all instances of pareto-k > 0.7.

But I wonder if loo_compare can still be used in the presence of such points. The ELPD difference and its SE are both obtained from the pointwise values at loo$pointwise. So, would this strategy, explicit in the code further below, make sense?

If M1 and M2 are models fitted to the same data D with rows indexed by id, put loo1<-loo(M1) and loo2<-loo(M2). Put parx1<-which(loo1$diag$par>0.7) and parx2<-which(loo2$diag$par>0.7). If M1 and M2 are plausible models, both parx1 and parx2 will be small subsets of id. They may overlap or even be the same. Put parx<-union(parx1,parx2). Then all the problems occur in data rows id[parx].

Form LC<-loo_compare(M1,M2) as usual.
Now use the pointwise data to compute LC_xparx from id[-parx].

If both LC and LC_xparx show M1 significantly better than M2, I would like to accept that
M1 is better, despite the presence of some points with pareto-k > 0.7.

#first model is as in
#https://discourse.mc-stan.org/t/loo-moment-match-failure-beta-binomial-example/35510

library(extraDistr)
library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend="cmdstanr")
library(loo)

set.seed(4)
N=1000
x<-runif(N)
fac1<-sample(LETTERS[1:20],N,replace=TRUE)
fac1<-factor(fac1)
ifac1<-as.integer(fac1)
g<-sample(c("South","West","North","East"),20,replace=TRUE)
g<-factor(g)
fac2<-g[ifac1]
ifac2<-as.integer(fac2)

rn1<-0.5*rnorm(20)
rn2<-rnorm(4)
lp<--0.5+sin(12*x)*exp(-3*x)+rn1[ifac1]+rn2[ifac2]
plot(x,lp,col=ifac1,pch=ifac2)
h<-function(t) (1+exp(-t))^-1
mu<-h(lp)
phi=50
T=1000
alpha<-mu*phi
beta<-(1-mu)*phi
y<-rbbinom(N,T,alpha=alpha,beta=beta)
plot(x,y/T)
#now alter some y values to create outliers
oyt<-order(y/T)
wh<-oyt[998:1000]
x[wh]
y[wh]
y[wh]<-30
fac1[wh]
fac2[wh]
plot(x,y/T)
points(x[wh],y[wh]/T,pch=16,col=2)

bfE1<-bf(y|trials(T)~-1+(1|fac2/fac1)+s(x),family="beta_binomial")
datE1<-data.frame(x,y,T,fac1,fac2)
dim(datE1)
mE1<-brm(bfE1,data=datE1,file="bbtest_E1")

bfE2<-bf(y|trials(T)~-1+(1|fac2/fac1),family="beta_binomial")
datE2<-datE1
mE2<-brm(bfE2,data=datE2,file="bbtest_E2")

looE1<-loo(mE1)
#three high values
plot(looE1)
parx1<-which(looE1$diag$par>0.7)

looE2<-loo(mE2)
#no high values
plot(looE2)
#just below 0.7 threshold
parx2<-which(looE2$diag$par>0.7)
#empty

parx1
parx2
parx<-union(parx1,parx2)
#parx=parx1
L<-length(parx)

LC<-loo_compare(looE1,looE2)
print(LC,digits=4)
sum(looE2$pointwise[,1]-looE1$pointwise[,1])
1000^0.5*sd(looE2$pointwise[,1]-looE1$pointwise[,1])

id<-1:(dim(datE1)[1])
datE1[parx,]

LC_xparx<-LC
LC_xparx[1,3]<-sum(looE1$pointwise[-parx,1])
LC_xparx[1,4]<-(1000-L)^0.5*sd(looE1$pointwise[-parx,1])
LC_xparx[1,5]<-sum(looE1$pointwise[-parx,3])
LC_xparx[1,6]<-(1000-L)^0.5*sd(looE1$pointwise[-parx,3])
LC_xparx[1,7]<-sum(looE1$pointwise[-parx,4])
LC_xparx[1,8]<-(1000-L)^0.5*sd(looE1$pointwise[-parx,4])

LC_xparx[2,1]<-sum(looE2$pointwise[-parx,1]-looE1$pointwise[-parx,1])
LC_xparx[2,2]<-(1000-L)^0.5*sd(looE2$pointwise[-parx,1]-looE1$pointwise[-parx,1])
LC_xparx[2,3]<-sum(looE2$pointwise[-parx,1])
LC_xparx[2,4]<-(1000-L)^0.5*sd(looE2$pointwise[-parx,1])
LC_xparx[2,5]<-sum(looE2$pointwise[-parx,3])
LC_xparx[2,6]<-(1000-L)^0.5*sd(looE2$pointwise[-parx,3])
LC_xparx[2,7]<-sum(looE2$pointwise[-parx,4])
LC_xparx[2,8]<-(1000-L)^0.5*sd(looE2$pointwise[-parx,4])

print(LC)
print(LC_xparx)

print(LC,simplify=FALSE)
print(LC_xparx,simplify=FALSE)

#both models fit better over id[-parx] and the superiority of mE1 is clearer there.
#but mE1 is also better over the full data.

Operating System: linux 5.15.0-91-generic Linux Mint 21.2 Victoria
brms Version: 2.21.0
loo Version: 2.7.0.9000

Yes. And if you think those outliers are not interesting to model, you can drop them from the data. For example, when investigating the outliers it is obvious they are wrongly recorded, and you don’t want to model the process leading to wrong records, just drop them and report with justification what was dropped. If you keep the outliers in the data, they may influence the other conclusions even it’s knonw that they are not well modelled. so it’s not clear why would you keep them and make the model choice decision just ignoring their pointwise elpd values.

thanks. The problem is that in my real dataset and model, I do not know if the few outliers are caused by missing covariates not yet in the model, or data errors. If I knew they were data errors I would drop them. Also, I found that increasing the run from 4 chains of 2000 (warmup 1000) to 4 chains of 3000 (warmup 1000) decreased the pareto-k values, some of which went from > 1 to < 1. So conceivably a much longer run would result in all pareto-k < 1, in which case I would not want to drop the outliers currently showing > 1. That is why I wonder whether the strategy of loo_compare as outlined would be reasonable.

Put it another way, if we refuse to do loo_compare in the presence of any points with pareto-k > 1 (or > 0.7) then it seems we are saying that the reported global and pointwise elpd values and elpd se are meaningless given the presence of those points. yet the reported values must mean something, I would like to think!

thanks

Greg

Estimated pointwise elpd values with associated k<1 are estimated to have finite mean and with k<0.7 small bias and are meaningful. Estimated pointwise elpd values with associated k>1 estimated to have non-finite mean. If the global summary is the sum of pointwise elpds, then sum of finite and non-finite is non-finite and not meaningful. It is possible that the estimate of ok is >1 even when the true k<1, but we can check that only asymptotically, which can require a sample size bigger than the number of atoms in the universe. If the pointwise data are not exchangeable but only conditionally exchangeable , e.g. conditioned on the covariates, you can report summaries of subset of pointwise elpds conditional on the covariates, as long as you make it clear that the summary is for a subset and that you don’t know how bad the model behavior is in case of excluded covariate values. Of course, even if you would have perfect LOO-CV computation, what you can say is based on finite number of observations and there is always uncertainty what happens in the future (unless you have infinite data size)

thanks

I’m about to try mixis which may be preferable, though slower. Will come
back on the forum if I get it to work

thanks

1 Like

Hello, May I ask a question here, I am using ‘ubms’ package for estimating abundance by N-mixture models.

When I include the sampling covariates and all sites covariates, warning message is pop up as Pareto k value is high. Then, I tried to use arguments as match_true = TRUE and k_threshold = 0.7), but the error is still there.

fit_TD

Call:
stan_pcount(formula = ~Time + Date ~ elevation + dbh + pine +
oak + SiteYear, data = GNumf, K = 30, prior_intercept_state = normal(-0.156,
5), init = init_fun, chains = 4, iter = 20000, cores = 4,
seed = 123, thin = 5, control = list(adapt_delta = 0.99,
max_treedepth = 16))

Abundance (log-scale):
Estimate SD 2.5% 97.5% n_eff Rhat
(Intercept) -0.0790 0.216 -0.5173 0.3344 7424 1
elevation -0.2148 0.344 -0.9078 0.4359 8029 1
dbh 0.9430 0.281 0.3981 1.4955 7830 1
pine 0.7188 0.347 0.0404 1.4082 7787 1
oak 0.6343 0.308 0.0444 1.2425 8153 1
SiteYearAMAM2024 -0.8016 0.386 -1.5710 -0.0595 7690 1
SiteYearHtarNgo2024 0.0934 0.532 -0.9472 1.1226 8005 1
SiteYearLoiMaide2023 -0.0886 0.406 -0.9234 0.6971 7832 1

Detection (logit-scale):
Estimate SD 2.5% 97.5% n_eff Rhat
(Intercept) -0.341 0.176 -0.707 -0.0163 7893 1
Time -0.281 0.233 -0.740 0.1601 7665 1
Date -0.303 0.276 -0.859 0.2233 8034 1

LOOIC: 660.877
Runtime: 31.375 min

loo(fit_TD, moment_match = TRUE)# Since warning message - high Pareto k iagnostic values are too high.

Computed from 8000 by 104 log-likelihood matrix.

     Estimate   SE

elpd_loo -330.4 28.8
p_loo 13.3 1.9
looic 660.9 57.6

MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.1]).

Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 102 98.1% 2943
(0.7, 1] (bad) 2 1.9%
(1, Inf) (very bad) 0 0.0%
See help(‘pareto-k-diagnostic’) for details.
Warning message:
Some Pareto k diagnostic values are too high. See help(‘pareto-k-diagnostic’) for details.

loo(fit_TD, k_threshold = 0.7)

Computed from 8000 by 104 log-likelihood matrix.

     Estimate   SE

elpd_loo -330.4 28.8
p_loo 13.3 1.9
looic 660.9 57.6

MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.8, 1.1]).

Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 102 98.1% 2943
(0.7, 1] (bad) 2 1.9%
(1, Inf) (very bad) 0 0.0%
See help(‘pareto-k-diagnostic’) for details.
Warning message:
Some Pareto k diagnostic values are too high. See help(‘pareto-k-diagnostic’) for details.

Hopefully, you can help and guide me when you are available. Many thanks in advance!

hello

I haven’t tried your example, but my experience has been a) moment-match did not work with my data, b) mixis did work, c) with a small synthetic dataset mixis and loo gave similar results, d) with my actual data mixis and loo gave similar results overall, but the highest values were a bit different, e) the disadvantages of mixis are:

  1. you need to re-fit the model (just once) with the altered code, unlike loo which works from the original fit
  2. I don’t know how to produce an equivalent of E_loo with mixis, to estimate the leave-one-out prediction for a function of the parameters, e.g. the predicted mean.

So currently, I need to use E_loo for that while separately refining the elpd estimates with mixis.

running mixis is not difficult. see if you can get it to work!

Once the pointwise elpd_mixis estimates are generated, I do model comparison in the same way as loo_compare, by obtaining a global difference (summing the pointwise differences) and the estimated se which is N^0.5 times the sd of the pointwise differences.

As to whether mixis is actually more accurate than loo, I’m just accepting what the vignette says: Mixture IS leave-one-out cross-validation for high-dimensional Bayesian models
and
[2209.09190] Robust leave-one-out cross-validation for high-dimensional Bayesian models
But I don’t know enough to say if that’s right…

Greg

1 Like