# Confusing results when using LOO to compare spline models

Hi,

I’m new to both splines and LOO CV so please bear with me if I’m making an obvious mistake :-)

I’m using data from an experiment with a binary outcome, four treatment groups and a continuous distance variable (N = 9805). I’m basically trying to model the nonlinear effect of this distance on the binary outcome. I’m trying to use splines to accomplish this, but I also tried linear, quadratic and cubic models just to see how they would differ and how they would compare in terms of LOO CV.

I’m looking at two types of splines:

1. O’Sullivan splines
2. B-splines

As expected splines compare favorably to linear, quadratic and cubic models, but what is confusing me is that one of the spline models (B-splines) looks very overfit to the data yet LOO is showing it to be superior. Here’s what the model fit for a grid of distances along the range of observed outcomes (the different facets are just the different arms of the study):

For the B-splines I followed the Stan splines case study and used a random walk prior to smooth the coefficients. I tried two different priors with:

1. \tau \sim N(0, 0.05) (in red in the above plot)
2. \tau \sim N(0, 0.005) (in green in the above plot)

I’m attaching my Stan code here: dist_splines3.stan (9.5 KB).

I’m using {rstan} v2.18.2 and {loo} v2.1.0. Spline knots were generated using {splines2} v0.2.8.

I expected the O’Sullivan models to be reported as the best, followed by the stronger prior B-splines model (BSPLINE2), but instead I got the following:

semiparam_loo <- semiparam_fit %>%
map(extract_log_lik, merge_chains = FALSE) %>%
map(~ loo(.x, r_eff = relative_eff(exp(.x)), cores = 2))

semiparam_loo %>%
loo_compare()

elpd_diff se_diff
SEMIPARAM_NONLINEAR_DIST_BSPLINE      0.0       0.0
SEMIPARAM_NONLINEAR_DIST_BSPLINE2  -122.1      13.7
SEMIPARAM_NONLINEAR_DIST_OSULLIVAN -140.5      15.2


Here’s the loo print() output for each model:

> semiparam_loo$SEMIPARAM_NONLINEAR_DIST_OSULLIVAN %>% print() Computed from 1200 by 9805 log-likelihood matrix Estimate SE elpd_loo -6228.0 31.9 p_loo 12.5 0.2 looic 12456.0 63.9 ------ Monte Carlo SE of elpd_loo is 0.1. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details. > semiparam_loo$SEMIPARAM_NONLINEAR_DIST_BSPLINE2 %>% print()

Computed from 1200 by 9805 log-likelihood matrix

Estimate   SE
elpd_loo  -6209.6 32.0
p_loo        19.8  0.2
looic     12419.2 63.9
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> semiparam_looSEMIPARAM_NONLINEAR_DIST_BSPLINE %>% print() Computed from 1200 by 9805 log-likelihood matrix Estimate SE elpd_loo -6087.5 34.7 p_loo 80.2 0.7 looic 12174.9 69.4 ------ Monte Carlo SE of elpd_loo is 0.2. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details.  It looks like the data is clustered. I assume that the black dots in the figure are averages in small intervals, in shich case they should look like binomially distributed but they seem to have overdispersion comapred to that which could be explainid by clustering, which can confuse spline fit and loo. Thanks, Aki, for your response. The data is indeed from a cluster randomized experiment. I modified my model to include a hierarchical cluster effect (\alpha_j below) and the fit I’m getting now looks like this (I’m including linear and quadratic models and dropping the O’Sullivan since it is the same as the b-splines model). \begin{align} Y_i &\sim Bernoulli(\Phi(-v^*_{j[i]})) \\ v^*_j &= \alpha+ \alpha_j + \beta_1\cdot Z_j + \underbrace{\beta_2\cdot (Z_j\times D_j)}_\text{linear effect} +\underbrace{f(D_j,Z_j;\delta)}_\text{nonlinear effect} \\ \alpha &\sim N(0, \tau_\alpha) \end{align} • i, individual • j[i], cluster • Z_j, treatment vector • D_j, distance and loo reports (comparison at the bottom): > loo_obj[c("LINEAR_DIST", "QUADRATIC_NONLINEAR_DIST", "SEMIPARAM_NONLINEAR_DIST_BSPLINE")]LINEAR_DIST

Computed from 1200 by 9805 log-likelihood matrix

Estimate   SE
elpd_loo  -6019.2 36.5
p_loo       118.1  1.2
looic     12038.4 73.0
------
Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

$QUADRATIC_NONLINEAR_DIST Computed from 1200 by 9805 log-likelihood matrix Estimate SE elpd_loo -6018.3 36.6 p_loo 117.2 1.2 looic 12036.5 73.2 ------ Monte Carlo SE of elpd_loo is 0.2. All Pareto k estimates are good (k < 0.5). See help('pareto-k-diagnostic') for details.$SEMIPARAM_NONLINEAR_DIST_BSPLINE

Computed from 1200 by 9805 log-likelihood matrix

Estimate   SE
elpd_loo  -6019.3 36.6
p_loo       118.6  1.2
looic     12038.7 73.2
------
Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo_obj[c("LINEAR_DIST", "QUADRATIC_NONLINEAR_DIST", "SEMIPARAM_NONLINEAR_DIST_BSPLINE")] %>%
+     loo_compare()
elpd_diff se_diff
LINEAR_DIST                      -0.9       1.1
SEMIPARAM_NONLINEAR_DIST_BSPLINE -1.1       1.0

• Is it safe to say that the problem with over-dispersion has been addressed and a linear fit is just as good as the two nonlinear ones?
• Would it be better to make this a multilevel model with respect to the other model coefficients (not just a cluster effect, i.e., \beta_1, \beta_2, and \delta above)? I’m particularly interested in investigating the apparent non-linearity in the last treatment arm (“bracelet”), but want to avoid making any spurious conclusions based on the model I happen to be using.

Thanks!

I cannot say “safe” without seeing more details, but given the information you provided I can say that in this case based on loo a linear fit is just as good as the two nonlinear ones.

Probably, Often multilevel models are better, but it varies how much better.

Got it. I’ll continue working on this multilevel issue.

One last issue: as I’m learning how to use loo I tried to run the same model, but instead of a Bernoulli, I’m using a Binomial model over the experiment clusters. Basically, I have this in my model section:

  if (use_binomial) {
cluster_takeup_count ~ binomial(cluster_size, cluster_takeup_prob);
} else {
takeup ~ bernoulli(cluster_takeup_prob[obs_cluster_id]);
}


and in the generated quantities section I have:

  if (use_binomial) {
for (cluster_index in 1:num_clusters) {
log_lik[cluster_index] = binomial_lpmf(cluster_takeup_count[cluster_index] | cluster_size[cluster_index], cluster_takeup_prob[cluster_index]);
}
} else {
for (obs_index in 1:num_obs) {
log_lik[obs_index] = bernoulli_lpmf(takeup[obs_index] | cluster_takeup_prob[obs_cluster_id[obs_index]]);
}
}


As far as I understand, the difference is that I’m “leaving out” at the cluster level (144) when using a Binomial model as opposed to the observation level (9805) when using a Bernoulli.

When I use loo() I get problems, and I’m not sure why:

> loo_obj[c("LINEAR_DIST", "QUADRATIC_NONLINEAR_DIST", "SEMIPARAM_NONLINEAR_DIST_BSPLINE")]
$LINEAR_DIST Computed from 1200 by 144 log-likelihood matrix Estimate SE elpd_loo -468.9 4.8 p_loo 98.3 4.3 looic 937.9 9.6 ------ Monte Carlo SE of elpd_loo is NA. Pareto k diagnostic values: Count Pct. Min. n_eff (-Inf, 0.5] (good) 7 4.9% 152 (0.5, 0.7] (ok) 36 25.0% 62 (0.7, 1] (bad) 81 56.2% 11 (1, Inf) (very bad) 20 13.9% 5 See help('pareto-k-diagnostic') for details.$QUADRATIC_NONLINEAR_DIST

Computed from 1200 by 144 log-likelihood matrix

Estimate   SE
elpd_loo   -469.5  5.1
p_loo        98.8  4.6
looic       939.0 10.3
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      4     2.8%   207
(0.5, 0.7]   (ok)       31    21.5%   63
(0.7, 1]   (bad)      90    62.5%   11
(1, Inf)   (very bad) 19    13.2%   7
See help('pareto-k-diagnostic') for details.

\$SEMIPARAM_NONLINEAR_DIST_BSPLINE

Computed from 1200 by 144 log-likelihood matrix

Estimate  SE
elpd_loo   -469.4 4.6
p_loo        98.5 4.1
looic       938.9 9.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      4     2.8%   141
(0.5, 0.7]   (ok)       37    25.7%   80
(0.7, 1]   (bad)      89    61.8%   13
(1, Inf)   (very bad) 14     9.7%   3
See help('pareto-k-diagnostic') for details.



Yes,

When you are leaving more data, the cross-validation posterior is more different from full posterior. If the difference is too big PSIS-LOO (and waic) fail in the adjustment. p_loo is estimated to be relatively high compared to the number of observations, which also supports the assumption that many Binomial observations are highly influential on some parameters. I recommend using K-fold-CV in this case.

Thanks, Aki. This has been very helpful and I really appreciate it.