Presenting small elpd_diffs based on multiple fits of the same models

I have additional loo data now, but could use some help making sense of what see. To begin with, what exactly does se_diff reflect? The first table in the OP shows that for the first fit of model B (identified as B1), se_diff is reported as ~5 in a comparison against the first fit of model A (identified as A1). Doesn’t this imply that if I keep fitting both models with the same iter but different seed, and loo_compare’ing the resultant individual pairs indefinitely, then the individual pairwise elpd_diffs are expected to center upon a mean of 2.9 with an average deviation of approximately |5|? Isn’t that pretty much the definition of “standard error”? Yet this is not what we observe empirically, at least not for the first five fits:

> (pairwise.diffs <- cbind(elpd_diff = 
  sapply(
    lapply(1:5, function(x) loo_compare(get(paste0("A.",x,".loo")), get(paste0("B.",x,".loo")))), 
      function(y) ifelse(startsWith(rownames(y)[1], "A"), yes = y[2,"elpd_diff"], no = -y[2,"elpd_diff"]) ) ) )

     elpd_diff
[1,] 2.9124594
[2,] 0.6537764
[3,] 0.9011423
[4,] 2.3322600
[5,] 3.9483423

> sd(pairwise.diffs)
[1] 1.382654 # Nowhere near 5.

Or is it the case that se_diff doesn’t actually refer to the expected SD of elpd_diffs in repeated fitting of the same two models to the same data, but rather to the expected SD of elpd_diffs in the unlikely scenario where we repeatedly collect a new, identically sized dataset from the same population and fit the same pair of models to each one?

But if the latter is true, then what is the source of the spread observed among the five pairwise comparisons above? On one hand, that spread is much smaller than the “official” se_diff of ~5. On the other, it seems to be much larger than any difference attributable to MCMC_SE alone, judging from the following: I bit the bullet and did as suggested by @jsocolar, fitting one gigantic model for each operationalization, with 100,000 post-warmup samples apiece (consuming a frightful amount of computational resources). All three loo-object-wise MCMC_SEs decreased to 0.1 (having been reported at 0.3 for the first 5 \times 3 models). Yet the pairwise se_diffs are virtually unchanged (again using A as the baseline of comparison):

        elpd_loo se_elpd_loo elpd_diff se_diff
B.huge -2086.812      40.991     2.525   5.029
A.huge -2089.337      41.025     0.000   0.000
C.huge -2090.590      41.015    -1.253   2.176

Was anything gained by this huge computational investment? se_diff has reportedly decreased by just 0.002, despite the MCMC_SEs having shrunk by two-thirds.

Or is the “gain” perhaps extant yet hidden from plain sight? If I could muster the computational resources to fit these huge models five times instead of just once, would the considerable variability observed across model-pair-wise elpd_diffs in the first code block now have disappeared?

So many questions aching for an answer…