Hi,
I’m new to both splines and LOO CV so please bear with me if I’m making an obvious mistake :-)
I’m using data from an experiment with a binary outcome, four treatment groups and a continuous distance variable (N = 9805). I’m basically trying to model the nonlinear effect of this distance on the binary outcome. I’m trying to use splines to accomplish this, but I also tried linear, quadratic and cubic models just to see how they would differ and how they would compare in terms of LOO CV.
I’m looking at two types of splines:
- O’Sullivan splines
- B-splines
As expected splines compare favorably to linear, quadratic and cubic models, but what is confusing me is that one of the spline models (B-splines) looks very overfit to the data yet LOO is showing it to be superior. Here’s what the model fit for a grid of distances along the range of observed outcomes (the different facets are just the different arms of the study):
For the B-splines I followed the Stan splines case study and used a random walk prior to smooth the coefficients. I tried two different priors with:
- \tau \sim N(0, 0.05) (in red in the above plot)
- \tau \sim N(0, 0.005) (in green in the above plot)
I’m attaching my Stan code here: dist_splines3.stan (9.5 KB).
I’m using {rstan} v2.18.2 and {loo} v2.1.0. Spline knots were generated using {splines2} v0.2.8.
I expected the O’Sullivan models to be reported as the best, followed by the stronger prior B-splines model (BSPLINE2), but instead I got the following:
semiparam_loo <- semiparam_fit %>%
map(extract_log_lik, merge_chains = FALSE) %>%
map(~ loo(.x, r_eff = relative_eff(exp(.x)), cores = 2))
semiparam_loo %>%
loo_compare()
elpd_diff se_diff
SEMIPARAM_NONLINEAR_DIST_BSPLINE 0.0 0.0
SEMIPARAM_NONLINEAR_DIST_BSPLINE2 -122.1 13.7
SEMIPARAM_NONLINEAR_DIST_OSULLIVAN -140.5 15.2
Here’s the loo print()
output for each model:
> semiparam_loo$SEMIPARAM_NONLINEAR_DIST_OSULLIVAN %>% print()
Computed from 1200 by 9805 log-likelihood matrix
Estimate SE
elpd_loo -6228.0 31.9
p_loo 12.5 0.2
looic 12456.0 63.9
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> semiparam_loo$SEMIPARAM_NONLINEAR_DIST_BSPLINE2 %>% print()
Computed from 1200 by 9805 log-likelihood matrix
Estimate SE
elpd_loo -6209.6 32.0
p_loo 19.8 0.2
looic 12419.2 63.9
------
Monte Carlo SE of elpd_loo is 0.1.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> semiparam_loo$SEMIPARAM_NONLINEAR_DIST_BSPLINE %>% print()
Computed from 1200 by 9805 log-likelihood matrix
Estimate SE
elpd_loo -6087.5 34.7
p_loo 80.2 0.7
looic 12174.9 69.4
------
Monte Carlo SE of elpd_loo is 0.2.
All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.