I have additional loo
data now, but could use some help making sense of what see. To begin with, what exactly does se_diff
reflect? The first table in the OP shows that for the first fit of model B (identified as B1), se_diff
is reported as ~5 in a comparison against the first fit of model A (identified as A1). Doesn’t this imply that if I keep fitting both models with the same iter
but different seed
, and loo_compare
’ing the resultant individual pairs indefinitely, then the individual pairwise elpd_diff
s are expected to center upon a mean of 2.9 with an average deviation of approximately |5|? Isn’t that pretty much the definition of “standard error”? Yet this is not what we observe empirically, at least not for the first five fits:
> (pairwise.diffs <- cbind(elpd_diff =
sapply(
lapply(1:5, function(x) loo_compare(get(paste0("A.",x,".loo")), get(paste0("B.",x,".loo")))),
function(y) ifelse(startsWith(rownames(y)[1], "A"), yes = y[2,"elpd_diff"], no = -y[2,"elpd_diff"]) ) ) )
elpd_diff
[1,] 2.9124594
[2,] 0.6537764
[3,] 0.9011423
[4,] 2.3322600
[5,] 3.9483423
> sd(pairwise.diffs)
[1] 1.382654 # Nowhere near 5.
Or is it the case that se_diff
doesn’t actually refer to the expected SD of elpd_diff
s in repeated fitting of the same two models to the same data, but rather to the expected SD of elpd_diff
s in the unlikely scenario where we repeatedly collect a new, identically sized dataset from the same population and fit the same pair of models to each one?
But if the latter is true, then what is the source of the spread observed among the five pairwise comparisons above? On one hand, that spread is much smaller than the “official” se_diff
of ~5. On the other, it seems to be much larger than any difference attributable to MCMC_SE alone, judging from the following: I bit the bullet and did as suggested by @jsocolar, fitting one gigantic model for each operationalization, with 100,000 post-warmup samples apiece (consuming a frightful amount of computational resources). All three loo-object-wise MCMC_SEs decreased to 0.1 (having been reported at 0.3 for the first 5 \times 3 models). Yet the pairwise se_diff
s are virtually unchanged (again using A as the baseline of comparison):
elpd_loo se_elpd_loo elpd_diff se_diff
B.huge -2086.812 40.991 2.525 5.029
A.huge -2089.337 41.025 0.000 0.000
C.huge -2090.590 41.015 -1.253 2.176
Was anything gained by this huge computational investment? se_diff
has reportedly decreased by just 0.002, despite the MCMC_SEs having shrunk by two-thirds.
Or is the “gain” perhaps extant yet hidden from plain sight? If I could muster the computational resources to fit these huge models five times instead of just once, would the considerable variability observed across model-pair-wise elpd_diff
s in the first code block now have disappeared?
So many questions aching for an answer…