Cooks distance and pareto-k

Short summary of the problem

hi all

One stage in model development is to identify outliers and consider whether they might be explained by some covariates not yet in the model. We can then run a variety of alternative models in order to resolve the problem or, if that fails, give up and exclude some data points. Because mgcv is so much faster than brms, I’m finding it essential for that stage before eventually returning to brms and loo.

But how can we identify outliers in a gam model which will correspond to points with pareto-k>0.7 when the equivalent model is run in brms / loo? We can use Cook’s Distance in mgcv but what threshold should we use?

Here is an example which may help someone to answer!

The idea is to fit a generalized pareto distribution to the top portion of the Cook’s Distance values in the gam model, and identify points where the gpd fit fails. In this example, as in a few others I’ve tried, all points for which pareto-k>0.7 are identified through the Cook’s procedure, but not conversely. Even so, if this works in general it would still be useful because resolving problems through model development in mgcv could resolve them in brms.

library(mgcv)
library(mev)

set.seed(123)
gs<-gamSim(eg=2,n=400,dist="normal",scale=0.1,verbose=TRUE)
attach(gs)
par(mfrow=c(2,2))
with(truth,persp(x,z,f))
with(data,plot(f,y))
with(data,plot(x,f))
with(data,plot(z,f))
par(mfrow=c(1,1))

Y<-data$y
X<-data$x
Z<-data$z

sam<-sample(400,10,replace=FALSE)
Y[sam]<-Y[sam]+rnorm(10,0,1)
plot(data$f,Y)
points(data$f[sam],Y[sam],pch=16,col=2)

gm1<-gam(Y~t2(X,Z),method="ML")
gam.check(gm1)
summary(gm1)
plot(gm1$fit,Y)
points(gm1$fit[sam],Y[sam],pch=16,col=2)

cd1<-cooks.distance(gm1)
ocd1<-order(cd1)
cd1o<-cd1[ocd1]
plot(cd1o)

top1<-cd1o[301:400]
top1x<-top1-cd1o[301]+1e-4
plot(top1x)

fgpd<-fit.gpd(top1x,method="Grimshaw")
sca<-fgpd$est[1]
sha<-fgpd$est[2]

U<-(1:100-0.5)/100
U<-rev(U)
V<-sca*(U^(-sha)-1)/sha
qqplot(V,top1x)
lines(c(0,1),c(0,1))
#7 appear as problem

cd1x<-order(cd1)[394:400]
plot(gm1$fit,Y)
points(gm1$fit[sam],Y[sam],pch=16,col=2)
points(gm1$fit[cd1x],Y[cd1x],col=3,cex=2,lwd=2)

library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend="cmdstanr")
library(loo)

dat1<-data.frame(X,Z,Y)
bf1<-bf(Y~t2(X,Z))
brm1<-brm(bf1,data=dat1,control=list(adapt_delta=0.9))
fit1<-fitted(brm1)[,1]
plot(gm1$fit,fit1)

loo1<-loo(brm1)
parx1<-which(loo1$diag$par>0.7)

which(!parx1%in%cd1x)
which(!cd1x%in%parx1)

plot(fit1,Y)
points(fit1[sam],Y[sam],pch=16,col=2)
points(fit1[cd1x],Y[cd1x],col=3,cex=2,lwd=2)
points(fit1[parx1],Y[parx1],col=4,cex=3,lwd=2)

qqplot(V,top1x)
lines(c(0,1),c(0,1))
ocd1t<-ocd1[301:400]
points(V[ocd1t%in%parx1],top1x[ocd1t%in%parx1],pch=16,col=2)

  • Operating System: Linux Mint 21.2 Victoria base: Ubuntu 22.04 jammy
  • brms Version: 2.21.0