Hi to all. I’m using brms and loo to fit an compare several spatial models where the response is multinomial. I can fit them properly without convergence problems, the issues arise when I tried to compare them. Firts, I get the warning about p_waic values greater tha 0.4 and no matter how many iterations I use, the standard errors maintains constant so I can determine which one is the best, even if the warning is ignored. If I use loo then I get the warning about the pareto values greater than 0.7 so finally I use kfold and get an error about the limitation of new locations in CAR models. I understand that this kind of problems are related to misspecification or vague spatial priors, but is a relative simple model and the spatial priors in brms are the ones recommended by Mitzi Morris. I really appreciate any help about the subject. Thank you.

Both p_waic and “its equivalent” p_loo are estimates of the effective number of parameters in your proposed model. It is recommended that this estimate be close to the true number of parameters to be estimated in the model. And clearly, that involves sample sizes and every statistical principle.

When you get the warning about p_waic values greater than 0.4, the diagnosis is not reliable (and 0.4 threshold is empirically chosen). Aki Vethari (the responsible for that specific warning message) recommends using PSIS-LOO (Here you can find a quick overview of p_loo and Pareto K-values).

If I’m not mistaken. Well summarized. Pareto K values in addition to indicating whether the model you are proposing is well or poorly specified for the phenomenon under study. It can give clues to influential points and leverage points. So this may be happening in your case (although I haven’t run your model with your data to confirm this).

Some references if you want to delve into the theory a little more are below.

In my experience with spatial models, those warning from loo are completely typical. I don’t think I’ve ever fit a spatial model that did not result in those loo warnings (though I stopped checking a long time ago). Personally, I take it to mean that WAIC (and similar) might be used only as a rough indicator for spatial model comparison.

The discussion that @Carlos_Zarzar linked to is useful, I’ll just point out especially this comment from @betanalpha

That suggests that these approximations are not well suited for spatial models: we use spatial models both because nearby observations are not IID (there is usually expected similarity), and because a draw from one side of the map is not expected to look like a draw from somewhere else.

Hi @Carlos_Zarzar and @cmcd. Thank you both. I was taking the time to make some additional tests and research with the informaion that Carlos shared. Michael’s comment is really illuminating, but it seems to contradic this old thread, “If WAIC is derived from cross-validation it does not need the assumption of independent errors”. Aki Vethari also says “WAIC is sane choice although it seems that it can break sometimes when using very flexible models (like Gaussian process or Markov random field spatial models)”. My models seems to fall in the last category of this post, actually spatial models are mentioned too, but, is there a way to correct them? Or, can I ignore the warning and still use WAIC or PSIS loo as a criteria to select one spatial model over another based on this information?

Your question is very deep @manjago. I have my limitations in statistics, I don’t know how familiar you are with it. But WAIC is an approximation of the leave one out Cross-Validation method. If you have problems with WAIC or LOO, I recommend using the cross-validation method. However, it requires a higher computational cost (depending on model complexity, data size, etc).

WAIC was introduced by Watanabe, 2010, who calls it the widely applicable information criterion (Watanabe & Opper, 2010). However, due to the similarity with the acronym WAIC, many know it as Watanabe–Akaike information criterion.

If you are interested, I suggest the references below:

Watanabe, S., & Opper, M. (2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of machine learning research, 11(12).

Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and computing, 24(6), 997-1016.

There are others around here who can provide far better answers to your questions about WAIC/LOO than I can, I’ll leave it to them if they’re able to respond. But you’re also getting into some general questions about model comparison. You can find a lot of authors in the spatial statistics literature using information criteria (especially DIC) for model comparison, even as the only point of comparison. Whatever you do with information criteria (but don’t ignore the warnings), I would follow many others in suggesting to generally/always draw on a number of methods for model comparison and criticism including analysis of residuals (visualizing and measuring residual spatial autocorrelation, examining outliers, and so on, while leveraging your background knowledge and research purpose/theory/questions).

The LOO diagnostics are also interesting for the way they suggest further model checking of this sort. Its also completely legitimate to find that among multiple models, none of them definitively stand out from the others.

Thinking about some investigative path for you to follow, I would suggest to you the Pareto distribution. It is used to mainly correct the tail distribution of weights by importance sampling. Thus Vehtari, Gelman and Gabry (2017) proposed the use of Pareto Smoothed Importance Sampling (PSIS) to stabilize importance sampling.

Cited reference:

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and computing , 27 (5), 1413-1432.

Loo is less reliable if k > 0.7. Typically we would recommend to either run exact LOO or (exact) k-fold CV for these options, or to increase MC sample size.

For spatial models, it is likely that we can further model the pointwise PSIS-loo results and get spatial smoothing. This extra smoothing step is likely to reduce variance. To that end, we can first run PSIS-LOO, get pointwise loo-lpd, then we may treat these numbers as noisy observations and run a spatial model, loo-lpd ~ point wise covariates.

Likewise, when there are two models, we can make the pointwise comparison smoother by modeling loo-lpd1 - loo-lpd2 ~ point wise covariates.

Hi @Carlos_Zarzar. Yes, I’m trying to understand what is causing the high values. As I mentioned before, my models seems to be very flexible, but I’m struggling to find a solution.

By “exact LOO” do you mean to refit n times with n subsets of n-1 observations?

If I build the models that you suggest and I get no warnings of waic or psis-loo, can I use the results to select my model? Or is there something that I’m missing? If I have more than two models, do I have to build every combinations of pairs?

About the the k-fold, is the recommended solution, but I get:

@manjago If it is feasible to run exact loo, the results will always be more reliable.
Vehtari et. al. has this tutorial on non-factorized models, which include spatial models

Thank you. Yes, this is what I thought. My dataset consist on 470 observations and I’m comparing several models. Considering my computing power limitations I think this would be unfeasible. Is not the objective of the article to validate the aproximation? Besides, does this not end in the same issue? I mean, If I understand @paul.buerkner, CAR models can handle new locations. That’s why k-fold is not working.

Ok, I get you. Now I understand your intentions. If you will allow me I will suggest a way for us to understand what is causing the high values of k. I’m not an expert but we can think about the problem together and maybe find a way to solve it.

Can you tell how many of the observations had values greater than 0.7? As you know, normally a print(loo(“fit_brm”)) shows this information.

Let’s try to identify which of your observations had threshold(k) values >= 0.7 and look for any pattern in these observations. With your experience with the data, it may be that this gives some indication of what is happening. And compare with your model forecasting (like graph, plot, or image can help you). Use the pareto_k_ids() function from the loo package to detect observations that generated k values greater than 0.7. This link can help you if needed.

Do you declare any prior information or does the brm() function use the a priori distributions by default? Do you believe that your priors could be causing these high values for k values?

Hi Carlos. Yes, 130 (27.7%) of the observations has values greater than 0.7. I can’t identify a pattern on them. About the priors, yes, I’m using the default brms priors. In the case of the spatial component, I think is ok. For the beta priors, I have no reason to doubt about them either, but the priors can’t be discarted as cause of the problem, I agree with that.

Additional information. As mentioned, n = 470. The p-loo of my base model is 375.3 and the number p of parameters is 1902. The latter was obtained by:

rstan::get_num_upars(your_brms_model$fit)

This was validated by Paul Buerkner here:

Based on this, I’m here:

“If p_loo < the number of parameters p and the number of parameters p is relatively large compared to the number of observations p>n/5, it is likely that model is so flexible or population prior is so weak that it’s difficult to predict for left out observation even if the model is true one.”

As discussed above, the prior can be the cause but I don’t know what I’m looking for in order to modified them.

I also made other recommended checks, the posterior predictive check (in this case for each condition) and the pit-loo check. They seem ok, but at least for the posterior predictive check, this does not mean that there is not a problem with the model. Other tests as k-fold cv are recommended, but it seems that this not possible for spatial models:

This post is actually a spatial model example, but the same solution is recommended:

Maybe I’m completely wrong, but in a model like this, where my response variable is an array of conditions counts, and I only have one per geographical unit, is natural to get this kind of problems:

Thanks again Carlos. If you or other members of the community have an idea, the help will be really appreciated.

Hi @manjago, I’ve been on vacation, but it seems you’ve been able to find the most relevant information. Some quick additions

LOO can be used for spatial models and is useful for diagnosing the observation model. Depending on the prediction task other cross-validation forms might be needed, see some visual examples at https://spatialsample.tidymodels.org/

The spatial prior is very weak. I’m guessing you also have some n that are 0 or N which make it more likely that the model overfits.

In my data, the response variable is the number of cases of three different conditions: normal weight, overweight and obesity in 470 geographical units (gus). For each one of these units I have eight covariates like % of unemployment, average years of schooling, etc. Our interest is just to estimate the effect of each of this covariates over the conditions. As you said, in some gus, we have 0 cases of obesity for example and the total number of cases (sum of cases of each condition) varies from 4 to 6102.

We think that variable selecction must be done, in order to use and explain just the variables that are really needed. I already read some of your articles and apparently, using CV or information criteria is not a good choice for this purposes, but I’m not clear about the alternatives. If I stick to the path recommended by Stan warnings, I get stuck in K-fold, because CAR models are not supported, it cannot handle new locations, and the command mi() that handles with missing data is just available for continuous dependent variables. An interesting alternative that I found is this one:

to achive something like this:

assigning randomly 0s and 1s in the weight variable instead of making the folds. However, I’m still working on it.

About the posterior predictive checks, if I use the regular command, for a continuos response, they look “to good to be true” nice, which supports your overfit statement I think.

Any advice? Besides the model selection issue, does all of these problems means that this model is bad or invalid?