Large Pareto K-values and significant Moran I with CAR structure

I am new to brms and trying to fit a brms model with Beta() distribution and CAR to account for spatial autocorrelation. My response variable represents ecosystem resilience and is a continuous value between 0.0 and 1.0. As I am using spatial data as input, the model has tested significant for spatial autocorrelation (using Moran’s I from both spdep and DHARMa.helpers). Here is my model code when including the CAR term, as well as calculating loo and testing for spatial autocorrelation:

resil.formula.sa <- resil ~ soilSand + slope + twi + fireFreq2015 + elephant + drought2015 + wetness2015 + geology + car(W, gr=ID)
resil.model.sa <- brm(formula = resil.formula.sa,  
                   data = train.df,
                   data2 = list(W=W),
                   family= Beta(link="logit"),
                   warmup = 4000, 
                   iter = 6000, 
                   threads = threading(6),
                   chains = 5, 
                   cores=5,
                   seed = 123,
                   save_pars = save_pars(all = TRUE))

loo.resil <- loo(resil.model.sa, save_psis = TRUE, cores = 16)
bres <- residuals(resil.model.sa, method = "posterior_predict")[,"Estimate"]
moran.test(bres, listW)

When running the the model without the CAR term, all Pareto k values from the loo package are < 0.7. However, when I include the CAR term, I suddenly have A LOT of bad Pareto k values:

Computed from 10000 by 11969 log-likelihood matrix.

         Estimate    SE
elpd_loo  29983.7  70.7
p_loo      8879.5  56.7
looic    -59967.4 141.4
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.2, 1.6]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     2446  20.4%   118     
   (0.7, 1]   (bad)      8620  72.0%   <NA>    
   (1, Inf)   (very bad)  903   7.5%   <NA>    
See help('pareto-k-diagnostic') for details.

In addition, my Moran’s I value is not better and still significant:

Moran I statistic standard deviate = 50.631, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.8384105677     -0.0001950078      0.0002743321
Computed from 10000 by 5129 log-likelihood matrix.

Using the CAR model gives me a much better pp_check() graph, better loo value and better Bayes R-squared. Thus, I am very confused by the very poor Pareto k values and how to address them. Is there a different way to check for spatial autocorrelation for CAR models? I have seen brms:::posterior_predict_gaussian_lagsar mentioned in this post , but I can’t find an equivalent for CAR structures. Unfortunately, Beta() families are not supported when implementing the SAR structures.

Thank you in advance for any insight and help!

CAR model adds one parameter for each location and if each location has one observation, then leave-one-out cross-validation removes all the data for the corresponding parameter, and the posterior for that parameter changes a lot. loo() function uses Pareto smoothed importance sampling for LOO-CV computation, but if the posterior changes too much, the importance sampling part fails. Pareto k values are diagnostic telling when the importance sampling fails (see LOO package glossary — loo-glossary • loo). When we get this many high Pareto k values, we can also look at p_loo which in this case is about 8880. Looking again at the LOO package glossary — loo-glossary • loo, we find

  • If p_loo < p and the number of parameters p is relatively large compared to the number of observations (e.g., p>N/5), it is likely that the model is so flexible or the population prior so weak that it’s difficult to predict the left out observation (even for the true model). This happens, for example, in the simulated 8 schools (in VGG2017), random effect models with a few observations per random effect, and Gaussian processes and spatial models with short correlation lengths.

which holds also for your CAR model.

With that flexible model pp_check() and Bayes R-squared are too optimistic and not useful at all. With that many high Pareto k’s the elpd_loo estimate is too optimistic and cannot be used in model comparison. See a related example in Roaches case study.

Roaches case study also illustrates that it is possible to use integrated-LOO approach to make the LOO-CV computation work reliably. Unfortunately, currently there is no easy way to do this with brms, and you would need to some coding outside of brms. We’re working on adding an easy way to brms, but as we want to make the easy way to work with many brms models, we need to a bit more work before it’s available. If I happen to have time in the next few days, I can try to code a quick code approach.

1 Like