Understanding `loo` results for non-standard `Contaminated Controls` model

Dear Stan Community,

I am trying to implement the convergence diagnostics from the loo package to the model below. I’m getting results which I find confusing, and was hoping you could help me further understand what they mean, and what next-steps I should take.

For reference, the underlying data is a sample of recruits to a terrorist organisation, stacked on-top of a sample of unlabelled controls (more info here). Notable non-standard features of this model are:

  • a contamination layer, implemented as a mixture of Bernoulli distributions, which accounts for the artificial data-generating process ;
  • an offset on the logit scale, indicating the known baseline rate of recruitment, used to obtain unbiased estimates of the intercept from this non-representative sample;
  • a set of BYM2 spatial random effects, indicating exposure to recruitment according to spatial proximity to recruits.

Q1. I have implemented the log-likelihood of the observations as:

generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = log_mix(rho[i],
                   bernoulli_lpmf(Y[i] | xi[1] ),
                   bernoulli_lpmf(Y[i] | xi[2] ) );

My first question is - is this the correct specification of the log-likelihood ? I worry a mistake here might be leading to weirdness in the loo output. Moreover, could the fact that the DGP is artificial (stacking cases over unlabelled controls) be contributing to strangeness in the loo output ?

Q2. From the loo results below (Convergence.Khat.pdf (25.7 KB)) I understand the model seems not to be converging according to the k-hat statistic. This is in contradiction with R-hat which is very well-behaved across all parameters (Convergence.Rhat.pdf (25.1 KB)). I would describe the traceplots (Convergence.Traceplots.Beta.pdf (90.3 KB)) for the regression coefficients of interest as reasonably well mixed.

What does this in mean in practice ? I have two convergence diagnostics which give me contradictory advice. More generally, how do I logically reconcile these these differences ?

Computed from 1008 by 980 log-likelihood matrix

         Estimate   SE
elpd_loo   -321.1 17.1
p_loo        95.8  5.9
looic       642.2 34.2
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     717   73.2%   151       
 (0.5, 0.7]   (ok)       133   13.6%   116       
   (0.7, 1]   (bad)       74    7.6%   38        
   (1, Inf)   (very bad)  56    5.7%   839       
See help('pareto-k-diagnostic') for details.

Q3. Looking at the Min. n_eff, I am surprised by the “very bad” category - I would have expected the n_eff to very small, and yet it seems the loo package is detecting a very large n_eff. How can I interpret this ?

Q4. If the k-hat statistic is ultimately to be trusted, then what advice could you give in terms of reparametrisation of the model that could improve things on this metric ?

The Stan code for the model is below. Thanks in advance for all your help.

data {
                              int<lower = 1> n; // total number of observations
                              int<lower = 1> p; // number of covariates in design matrix

                               real log_offset; // offset
                                    real theta; // Pr(Y = 1 | r = 1, s = 1)

                                matrix[n, p] X; // design matrix
                           int<lower = 0> Y[n]; // vector of labels

                         int<lower = 1> N_dist; // number of small-areas
                          int<lower = 1> N_gov; // number of large-areas
                     int<lower = 1> dist_id[n]; // small-area id
                      int<lower = 1> gov_id[n]; // large-area id

                     int<lower=0> N_dist_edges; // Data for improper, efficient spatial prior:
int<lower=1, upper=N_dist> node1[N_dist_edges]; // node1[i] adjacent to node2[i]
int<lower=1, upper=N_dist> node2[N_dist_edges]; // and node1[i] < node2[i]
                           real scaling_factor; // scaling factor derived from the adjacency matrix

parameters {
                 vector[N_dist] phi; // small-area random effect - main
                 vector[N_dist] psi; // small-area random effect - spatial
                  vector[N_gov] eta; // gov random effect
   real<lower = 0,upper = 1> lambda; // mixing weight
                real<lower = 0> sigma; // sd of small-area random effect
            real<lower = 0> tau_eta; // precision of large-area random effect
                     vector[p] beta; // fixed effects coefficients

transformed parameters{
            real<lower = 0> sigma_eta; // sd of large-area random effect
            real<lower = 0> tau; // precision of small-area random effect

             vector[N_dist] gamma; // convoluted small-area effect
                     vector[n] mu; // logit-scale propensity to be a recruit - according to the non-probability sampling framework
                     vector[n] mu_star; // logit-scale propensity to be a recruit - in absence of the non-probability sampling framework

real<lower = 0, upper = 1> rho[n]; // propensity to be a recruit - according to the non-probability sampling framework
real<lower = 0, upper = 1> rho_star[n]; // propensity to be a recruit - in absence of the non-probability sampling framework

 real<lower = 0, upper = 1> xi[2]; // probability of being labeled a recruit

      sigma_eta = sqrt(1/tau_eta); // large-area scale
      tau = pow(sigma,-2); // small-area scale

// variance of each component should be approximately equal to 1
     gamma =  sqrt(1-lambda) * phi + sqrt(lambda / scaling_factor) * psi ; 
 // linear function of the logit-scale propensity to be a recruit
mu = log_offset + X * beta + gamma[dist_id]*sigma + eta[gov_id]*sigma_eta;
mu_star = X * beta + gamma[dist_id]*sigma + eta[gov_id]*sigma_eta;
for(i in 1:n){
       rho[i] = inv_logit(mu[i]); // propensity to be a recruit
       rho_star[i] = inv_logit(mu_star[i]); // propensity to be a recruit
                   xi[1] = theta; // defining the propensities of being a label for each of the candidate models in the mixture
                       xi[2] = 0;

model {

// // // Coefficients
        beta[1] ~ cauchy(0, 10);
for(i in 2:p){
       beta[i] ~ cauchy(0, 2.5); // prior on fixed-effects, excluding the intercept

// // // Random Effect on Small-Area
              phi ~ normal(0,1);
// // // ICAR priors
  target += -0.5 * dot_self(psi[node1] - psi[node2]);
// soft sum-to-zero constraint on psi,
// equivalent to mean(psi) ~ normal(0,0.01)
  sum(psi) ~ normal(0, 0.01 * N_dist);
  // // // Random Effect on Large-Area
  eta ~ normal(0,1);
  tau_eta ~ gamma(0.1,0.1);
// // //  Convolution Parameters
    lambda ~ beta(0.5,0.5); //dirichlet(alpha);
    sigma ~ normal(0,1);

// // // Contamination 
for (i in 1:n){
 target += log_mix(rho[i],
                   bernoulli_lpmf(Y[i] | xi[1] ),
                   bernoulli_lpmf(Y[i] | xi[2] ) ); // labels distributed as mixture of bernoulli distributions

generated quantities {
  vector[n] log_lik;
  for (i in 1:n) {
    log_lik[i] = log_mix(rho[i],
                   bernoulli_lpmf(Y[i] | xi[1] ),
                   bernoulli_lpmf(Y[i] | xi[2] ) );

A further question is perhaps more broadly what is the new state-of-the-art for procedure convergence checking in large hierarchical models ?

I’m gathering from this post by Dan Simpson and this article that we are dropping traditional Rhat and traceplots as we have become more aware of their limitations.

But of the many measures and plots which have been proposed (in the article and , separately, in the loo package) which should we use ? The article doesn’t mention cross-validation for convergence checking, and doesn’t compare instances where the new Rhat measures and the loo measures are in contradiction - is the idea that if any of these measures is off we should consider the model non-convergent ?

Hi Roberto,

I haven’t found time to look at your specification of the likelihood, but I think I can clear up some of the issues around r-hat and Pareto k.

First, the Pareto k diagnostic is not a convergence diagnostic. Neither is cross-validation. More on that below.

Second, the materials regarding r-hat that you link in your second post point out a problem with the traditional r-hat metric, and then they present a new and improved r-hat that solves the problem. All of the Stan ecosystem (at least the regularly maintained parts) use this new r-hat diagnostic. With that said, r-hat still isn’t always sensitive to lack of convergence. Practically speaking, however, if you have a model without divergences, without bfmi warnings, with acceptable effective sample sizes, and with low r-hat, then you can feel reasonably confident about convergence, particularly if you’ve run a not-too-small number of chains with inits that are overdispersed relative to the posterior. To feel even more confident, you might run SBC (simulation-based calibration) on your model.

Third, cross validation is good for evaluating model adequacy or misspecification, but has nothing to do with convergence. While poor fit and misspecification sometimes obstruct convergence in practice, they are conceptually orthogonal to whether the MCMC estimators have converged.

Fourth, the Pareto k diagnostic is a diagnostic to flag when the PSIS-LOO approximation to true leave-one-out cross-validation is failing. In some contexts, it can also be indicative of misspecification. Again, it is never a convergence diagnostic. For more about PSIS-LOO and Pareto k, see here LOO package glossary — loo-glossary • loo


Jacob - thank you very much, this is immensely helpful to me - it clears up this fact that k has nothing to do with MCMC convergence.
My confusion arose as I was reading [Vehtari, A., et al (2019)](https://preprint arXiv:1507.02646) - in particular the passage `k is a good convergence diagnostic in finite samples and in high dimensions’ . I now see that this convergence here is indicating the point at which importance sampling estimator is valid - I should have read this more carefully.

I still have this outstanding question of why the results from loo are so poor - I read on the reference manual for the function that:

If p_loo < p and the number of parameters p is relatively large compared to the number of observations (e.g., p>N/5), it is likely that the model is so flexible or the population prior so weak that it’s difficult to predict the left out observation (even for the true model). This happens, for example, in the simulated 8 schools (in VGG2017), random effect models with a few observations per random effect, and Gaussian processes and spatial models with short correlation lengths.

It’s hard for me to disentangle the misspecification / flexibility in this particular case - is my model miss-specified or too flexible ?

And moreover - is loo even an appropriate metric to compare model specification and accuracy in this contaminated controls example ?
One can envision a model which is “wrong” by design - made to fit the actual DGP as opposed to this synthetic DGP, which we would prefer but would perform poorly on fit metrics.
Indeed I think much of the structure added to the model - like offset and contamination layer - is likely decreasing in-sample fit by at least increasing the number of parameters, and also actively ignoring in-sample observations (the contamination part of the model effectively pretends 0-s are 1s, for example).

My final question is about the Min. n_eff: is it at all surprising that we have large n_eff in the “very bad” category ? I would have expected that number to be smaller than for the good and ok categories.

I don’t have time to really understand the model in depth right now :(

But I think I can help with a few of these questions:

So the question here is how many parameters are in your model and how many data points you are fitting. By comparing these two numbers to p_loo and to each other, you should be able to figure out which regime you are in. If you’re using rstan, then rstan::get_num_upars will help you find how many parameters are in the model.

I don’t understand exactly what the model is doing, but in general I would suggest trying to write down a model that captures the true data-generating process as nearly as possible. I think that the contamination layer is part of the assumed true data-generating process that generates the data that the model sees. If it’s not, then I don’t understand the motivation for including it.
Note that loo prioritizes good predictive performance, which is not the same thing as capturing the true data-generating process. With finite data and a complicated DGP, the true DGP might be impossible to fit (i.e. attempting to fit the true DGP might be prone to non-identification and very sensitive to the priors), and might yield really bad predictive performance.
Good predictive performance is not the same thing as accurate uncertainty quantification around causal effect sizes of interest. If your goal is inference about effect sizes and you believe a priori that some particular feature of the model, or some nuisance covariate, is important, then you might choose to leave that model feature in your final model even if loo says that doing so worsens predictive performance.

Perhaps @avehtari will chime in about whether this is surprising or not, but I don’t know of any reason to be surprised. For what it’s worth, the loo package documentation says:

the PSIS effective sample size estimate will be over-optimistic when the estimate of 𝑘k is greater than 0.7 .


In the very bad category the effective sample size estimate is very unreliable. I think we could report effective sample size 0 in that category. I made an issue for loo package about this.