Dear Stan Community,
I am trying to implement the convergence diagnostics from the loo package to the model below. I’m getting results which I find confusing, and was hoping you could help me further understand what they mean, and what next-steps I should take.
For reference, the underlying data is a sample of recruits to a terrorist organisation, stacked on-top of a sample of unlabelled controls (more info here). Notable non-standard features of this model are:
- a
contamination
layer, implemented as a mixture of Bernoulli distributions, which accounts for the artificial data-generating process ; - an
offset
on the logit scale, indicating the known baseline rate of recruitment, used to obtain unbiased estimates of the intercept from this non-representative sample; - a set of
BYM2 spatial random effects
, indicating exposure to recruitment according to spatial proximity to recruits.
Q1. I have implemented the log-likelihood of the observations as:
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = log_mix(rho[i],
bernoulli_lpmf(Y[i] | xi[1] ),
bernoulli_lpmf(Y[i] | xi[2] ) );
}
}
My first question is - is this the correct specification of the log-likelihood ? I worry a mistake here might be leading to weirdness in the loo output. Moreover, could the fact that the DGP is artificial (stacking cases over unlabelled controls) be contributing to strangeness in the loo
output ?
Q2. From the loo
results below (Convergence.Khat.pdf (25.7 KB)) I understand the model seems not to be converging according to the k-hat statistic. This is in contradiction with R-hat which is very well-behaved across all parameters (Convergence.Rhat.pdf (25.1 KB)). I would describe the traceplots (Convergence.Traceplots.Beta.pdf (90.3 KB)) for the regression coefficients of interest as reasonably well mixed.
What does this in mean in practice ? I have two convergence diagnostics which give me contradictory advice. More generally, how do I logically reconcile these these differences ?
Computed from 1008 by 980 log-likelihood matrix
Estimate SE
elpd_loo -321.1 17.1
p_loo 95.8 5.9
looic 642.2 34.2
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 717 73.2% 151
(0.5, 0.7] (ok) 133 13.6% 116
(0.7, 1] (bad) 74 7.6% 38
(1, Inf) (very bad) 56 5.7% 839
See help('pareto-k-diagnostic') for details.
Q3. Looking at the Min. n_eff, I am surprised by the “very bad” category - I would have expected the n_eff to very small, and yet it seems the loo package is detecting a very large n_eff. How can I interpret this ?
Q4. If the k-hat statistic is ultimately to be trusted, then what advice could you give in terms of reparametrisation of the model that could improve things on this metric ?
The Stan code for the model is below. Thanks in advance for all your help.
data {
int<lower = 1> n; // total number of observations
int<lower = 1> p; // number of covariates in design matrix
real log_offset; // offset
real theta; // Pr(Y = 1 | r = 1, s = 1)
matrix[n, p] X; // design matrix
int<lower = 0> Y[n]; // vector of labels
int<lower = 1> N_dist; // number of small-areas
int<lower = 1> N_gov; // number of large-areas
int<lower = 1> dist_id[n]; // small-area id
int<lower = 1> gov_id[n]; // large-area id
int<lower=0> N_dist_edges; // Data for improper, efficient spatial prior:
int<lower=1, upper=N_dist> node1[N_dist_edges]; // node1[i] adjacent to node2[i]
int<lower=1, upper=N_dist> node2[N_dist_edges]; // and node1[i] < node2[i]
real scaling_factor; // scaling factor derived from the adjacency matrix
}
parameters {
vector[N_dist] phi; // small-area random effect - main
vector[N_dist] psi; // small-area random effect - spatial
vector[N_gov] eta; // gov random effect
real<lower = 0,upper = 1> lambda; // mixing weight
real<lower = 0> sigma; // sd of small-area random effect
real<lower = 0> tau_eta; // precision of large-area random effect
vector[p] beta; // fixed effects coefficients
}
transformed parameters{
real<lower = 0> sigma_eta; // sd of large-area random effect
real<lower = 0> tau; // precision of small-area random effect
vector[N_dist] gamma; // convoluted small-area effect
vector[n] mu; // logit-scale propensity to be a recruit - according to the non-probability sampling framework
vector[n] mu_star; // logit-scale propensity to be a recruit - in absence of the non-probability sampling framework
real<lower = 0, upper = 1> rho[n]; // propensity to be a recruit - according to the non-probability sampling framework
real<lower = 0, upper = 1> rho_star[n]; // propensity to be a recruit - in absence of the non-probability sampling framework
real<lower = 0, upper = 1> xi[2]; // probability of being labeled a recruit
sigma_eta = sqrt(1/tau_eta); // large-area scale
tau = pow(sigma,-2); // small-area scale
// variance of each component should be approximately equal to 1
gamma = sqrt(1-lambda) * phi + sqrt(lambda / scaling_factor) * psi ;
// linear function of the logit-scale propensity to be a recruit
mu = log_offset + X * beta + gamma[dist_id]*sigma + eta[gov_id]*sigma_eta;
mu_star = X * beta + gamma[dist_id]*sigma + eta[gov_id]*sigma_eta;
for(i in 1:n){
rho[i] = inv_logit(mu[i]); // propensity to be a recruit
rho_star[i] = inv_logit(mu_star[i]); // propensity to be a recruit
}
xi[1] = theta; // defining the propensities of being a label for each of the candidate models in the mixture
xi[2] = 0;
}
model {
// // // Coefficients
beta[1] ~ cauchy(0, 10);
for(i in 2:p){
beta[i] ~ cauchy(0, 2.5); // prior on fixed-effects, excluding the intercept
}
// // // Random Effect on Small-Area
phi ~ normal(0,1);
// // // ICAR priors
target += -0.5 * dot_self(psi[node1] - psi[node2]);
// soft sum-to-zero constraint on psi,
// equivalent to mean(psi) ~ normal(0,0.01)
sum(psi) ~ normal(0, 0.01 * N_dist);
// // // Random Effect on Large-Area
eta ~ normal(0,1);
tau_eta ~ gamma(0.1,0.1);
// // // Convolution Parameters
lambda ~ beta(0.5,0.5); //dirichlet(alpha);
sigma ~ normal(0,1);
// // // Contamination
for (i in 1:n){
target += log_mix(rho[i],
bernoulli_lpmf(Y[i] | xi[1] ),
bernoulli_lpmf(Y[i] | xi[2] ) ); // labels distributed as mixture of bernoulli distributions
}
}
generated quantities {
vector[n] log_lik;
for (i in 1:n) {
log_lik[i] = log_mix(rho[i],
bernoulli_lpmf(Y[i] | xi[1] ),
bernoulli_lpmf(Y[i] | xi[2] ) );
}
}