Cooks distance and pareto-k

gregd · September 27, 2024, 9:40am

Short summary of the problem

hi all

One stage in model development is to identify outliers and consider whether they might be explained by some covariates not yet in the model. We can then run a variety of alternative models in order to resolve the problem or, if that fails, give up and exclude some data points. Because mgcv is so much faster than brms, I’m finding it essential for that stage before eventually returning to brms and loo.

But how can we identify outliers in a gam model which will correspond to points with pareto-k>0.7 when the equivalent model is run in brms / loo? We can use Cook’s Distance in mgcv but what threshold should we use?

Here is an example which may help someone to answer!

The idea is to fit a generalized pareto distribution to the top portion of the Cook’s Distance values in the gam model, and identify points where the gpd fit fails. In this example, as in a few others I’ve tried, all points for which pareto-k>0.7 are identified through the Cook’s procedure, but not conversely. Even so, if this works in general it would still be useful because resolving problems through model development in mgcv could resolve them in brms.

library(mgcv)
library(mev)

set.seed(123)
gs<-gamSim(eg=2,n=400,dist="normal",scale=0.1,verbose=TRUE)
attach(gs)
par(mfrow=c(2,2))
with(truth,persp(x,z,f))
with(data,plot(f,y))
with(data,plot(x,f))
with(data,plot(z,f))
par(mfrow=c(1,1))

Y<-data$y
X<-data$x
Z<-data$z

sam<-sample(400,10,replace=FALSE)
Y[sam]<-Y[sam]+rnorm(10,0,1)
plot(data$f,Y)
points(data$f[sam],Y[sam],pch=16,col=2)

gm1<-gam(Y~t2(X,Z),method="ML")
gam.check(gm1)
summary(gm1)
plot(gm1$fit,Y)
points(gm1$fit[sam],Y[sam],pch=16,col=2)

cd1<-cooks.distance(gm1)
ocd1<-order(cd1)
cd1o<-cd1[ocd1]
plot(cd1o)

top1<-cd1o[301:400]
top1x<-top1-cd1o[301]+1e-4
plot(top1x)

fgpd<-fit.gpd(top1x,method="Grimshaw")
sca<-fgpd$est[1]
sha<-fgpd$est[2]

U<-(1:100-0.5)/100
U<-rev(U)
V<-sca*(U^(-sha)-1)/sha
qqplot(V,top1x)
lines(c(0,1),c(0,1))
#7 appear as problem

cd1x<-order(cd1)[394:400]
plot(gm1$fit,Y)
points(gm1$fit[sam],Y[sam],pch=16,col=2)
points(gm1$fit[cd1x],Y[cd1x],col=3,cex=2,lwd=2)

library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend="cmdstanr")
library(loo)

dat1<-data.frame(X,Z,Y)
bf1<-bf(Y~t2(X,Z))
brm1<-brm(bf1,data=dat1,control=list(adapt_delta=0.9))
fit1<-fitted(brm1)[,1]
plot(gm1$fit,fit1)

loo1<-loo(brm1)
parx1<-which(loo1$diag$par>0.7)

which(!parx1%in%cd1x)
which(!cd1x%in%parx1)

plot(fit1,Y)
points(fit1[sam],Y[sam],pch=16,col=2)
points(fit1[cd1x],Y[cd1x],col=3,cex=2,lwd=2)
points(fit1[parx1],Y[parx1],col=4,cex=3,lwd=2)

qqplot(V,top1x)
lines(c(0,1),c(0,1))
ocd1t<-ocd1[301:400]
points(V[ocd1t%in%parx1],top1x[ocd1t%in%parx1],pch=16,col=2)

Operating System: Linux Mint 21.2 Victoria base: Ubuntu 22.04 jammy
brms Version: 2.21.0

Topic		Replies	Views
Loo_compare in the presence of high pareto-k brms loo	4	297	June 25, 2024
Pareto k values are too high when running model comparison Modeling loo	10	3058	March 11, 2020
Problems with model comparison via loo and kfold Modeling loo , brms	10	937	February 8, 2021
Large Pareto K-values and significant Moran I with CAR structure Modeling fitting-issues , loo , spatial , brms	1	187	May 18, 2024
Pareto K diagnostics and kfold model comparison brms loo	15	4940	February 13, 2019

Cooks distance and pareto-k

Related topics