Presenting small elpd_diffs based on multiple fits of the same models

I’m comparing operationalizations A, B, and C of the same predictor in an unordered categorical multilevel model. There’s a lot of uncertainty due to data sparseness (30 covariates in a dataset of 2217).

In a single comparison, operationalization B looks the most promising, with elpd_diff = 2.9 and se_diff = 5.0 compared to operationalization A. Operationalization C doesn’t look so good, with elpd_diff = -1.1 and se_diff = 2.0. But with the SEs larger than the differences, there’s reason to fear that this ranking might reflect nothing but random fluctuations of elpd_diff.

To guard against this danger, I have now fit all three models a total of five times using a different seed for each fit. The results are seen below. The first fit of operationalization A has been used as baseline in each comparison.

     elpd_loo se_elpd_loo elpd_diff se_diff
B.1 -2085.783      40.930     2.912   5.011
B.5 -2086.004      40.936     2.692   5.012
B.4 -2086.316      40.980     2.380   5.042
B.2 -2087.295      40.990     1.400   5.037
A.2 -2087.949      40.952     0.747   0.512
B.3 -2088.020      41.031     0.676   5.055
A.4 -2088.648      40.983     0.047   0.514
A.1 -2088.696      40.946     0.000   0.000
A.3 -2088.921      41.001    -0.225   0.508
C.3 -2089.539      40.971    -0.843   2.181
C.1 -2089.840      40.962    -1.144   2.173
A.5 -2089.952      40.992    -1.256   0.524
C.2 -2090.632      40.997    -1.936   2.176
C.4 -2091.067      41.021    -2.372   2.175
C.5 -2091.520      41.040    -2.824   2.200

(I would round these to one digit if presented in my paper.)

To my mind, these results constitute tentative assurance that the initial difference (between A.1 and B.1) is not a mere chance fluctuation, and that operationalization B is probably the best. I’m thinking about characterizing operationalization B as “our best guess at the best operationalization based on a noisy signal”.

The table above is pretty large and detailed. Picking the plausibly best operationalization for this particular predictor is only one step in a long and onerous variable selection process, so if possible, I’d rather use something shorter and more succinct. Would it be alright if I just averaged the elpd_diffs and se_diffs over the five fits? Like so:

comparisons$Oper <- substr(rownames(comparisons), 1, 1)
means <- cbind(
  mean_elpd_diff = tapply(comparisons$elpd_diff, comparisons$Oper, mean),
  mean_se_diff = tapply(comparisons$se_diff, comparisons$Oper, mean))
  means[order(means[,"mean_elpd_diff"], decreasing = T),]
  , digits = 3)

  mean_elpd_diff mean_se_diff
B          2.012        5.031
A         -0.138        0.412
C         -1.824        2.181

This would be rounded to one digit if presented in my paper. To my mind, the smaller table makes the same point as the bigger one while not boring my readers with as much detail. What say our wise forum gurus?

1 Like

Yes, such small differences are negligible.

That helps to reduce the uncertainty due to Monte Carlo variability, but it doesn’t help to reduce the uncertainty due to having a finite data set used to approximate the unknown data distribution. loo package shows separately the MCSE as something like (from the loo package vignette)

Monte Carlo SE of elpd_loo is 0.2

By using an infinite number of iterations in one fit or by combining results from infinite number of fits would make this uncertainty to go to 0, but it would not help reducing se_diff as the amount of data is not changing.

It would be better to combine the posterior draws to single fits and then use loo o compute elpd_diffs and se_diffs as the changing the order of expectations and non-linear functions will matter. However, this would not change the conclusion that the differences are negligible



What about an approach along the following lines:

loo::loo_model_weights(list(A.1.loo, B.1.loo, C.1.loo), method = "pseudobma")

#Method: pseudo-BMA+ with Bayesian bootstrap
A.1 0.209 
B.1 0.663 
C.1 0.128 

^ This is based solely on the first of the five fits of each model, and it seems to be placing much more confidence in operationalization B than in A or C. If I could present a table like the above and be justified in proceeding with operationalization B, I could live with that. Also, if there’s a chance it might help, I am definitely willing to…

…as long as someone first tells me how it’s done. This is the first time I hear about such a procedure.

For the record, each operationalization involves more than one covariate: there’s one grouping factor (with differing numbers of levels), and there are one to two population-level covariates which describe “generalizable” characteristics of the groups. It’s a bit like having “child” versus “family” as alternative grouping factors, along with “Child’s Math Grade” vs “Mean Math Grade in Family” as alternative generalizable characteristics.

Operationalization B has the fewest explicit model parameters out of the three alternatives, so if there’s any indication that it fits better (or at least as well) compared to the others, then one would want to use it in order to simplify the model and reduce computational complexity for subsequent model fitting.

There is no single winner, The weights indicate the combination of all models might be better than any single model (although possibly not much better, and you might get also similar performance with any of these models).

See Bind draws objects together — bind_draws • posterior

I’m not able to provide a more detailed example now, as I’m just about to go offline for a few weeks. If no-one else joins this thread and this is sill unsolved after a few weeks you can ping me.


If this sounds complicated, the easier solution would be to just refit the different models but run the chains for five times more sampling iterations, and then proceed as normal. This will bring down the monte carlo error, but will not change the fact that the data do not contain overwhelming evidence that one model is better than another. Your inference that there is no reason to prefer any model over B seems well founded, though if the models ultimately yield substantially different inference (or if the inference you desire is specifically about which model fits better) you would do well to remain cautious.


Computation is costly, so rather than discarding the already existing fits and starting over, I’d prefer to make full use of them and see if that’s enough to minimize uncertainty to the degree possible. I’ve already stacked the five posteriors together for each operationalization, so that now I have a single posterior matrix with 20,000 rows for each operationalization, rather than five 4,000-row posterior matrices per operationalization. But I do not have stanfit objects to go with the 20000-row posteriors matrices, so do not know how to compute a single elpd_loo for each operationalization. How might I proceed?

If you don’t need moment matching, then form the log-likelihood matrix and pass that to loo. If you’re using moment matching, then you will need to go spelunking inside the stanfit to build a new one. This is doable but is a bit involved, and isn’t something that I can give useful advice on off the top of my head, beyond to note that rstan::read_stan_csv or brms:::read_csv_as_stanfit might be useful guides to reveal how the stanfit is constructed (these are funtions that translate csv files output by cmdstan to stanfits in memory in R).

1 Like

I have additional loo data now, but could use some help making sense of what see. To begin with, what exactly does se_diff reflect? The first table in the OP shows that for the first fit of model B (identified as B1), se_diff is reported as ~5 in a comparison against the first fit of model A (identified as A1). Doesn’t this imply that if I keep fitting both models with the same iter but different seed, and loo_compare’ing the resultant individual pairs indefinitely, then the individual pairwise elpd_diffs are expected to center upon a mean of 2.9 with an average deviation of approximately |5|? Isn’t that pretty much the definition of “standard error”? Yet this is not what we observe empirically, at least not for the first five fits:

> (pairwise.diffs <- cbind(elpd_diff = 
    lapply(1:5, function(x) loo_compare(get(paste0("A.",x,".loo")), get(paste0("B.",x,".loo")))), 
      function(y) ifelse(startsWith(rownames(y)[1], "A"), yes = y[2,"elpd_diff"], no = -y[2,"elpd_diff"]) ) ) )

[1,] 2.9124594
[2,] 0.6537764
[3,] 0.9011423
[4,] 2.3322600
[5,] 3.9483423

> sd(pairwise.diffs)
[1] 1.382654 # Nowhere near 5.

Or is it the case that se_diff doesn’t actually refer to the expected SD of elpd_diffs in repeated fitting of the same two models to the same data, but rather to the expected SD of elpd_diffs in the unlikely scenario where we repeatedly collect a new, identically sized dataset from the same population and fit the same pair of models to each one?

But if the latter is true, then what is the source of the spread observed among the five pairwise comparisons above? On one hand, that spread is much smaller than the “official” se_diff of ~5. On the other, it seems to be much larger than any difference attributable to MCMC_SE alone, judging from the following: I bit the bullet and did as suggested by @jsocolar, fitting one gigantic model for each operationalization, with 100,000 post-warmup samples apiece (consuming a frightful amount of computational resources). All three loo-object-wise MCMC_SEs decreased to 0.1 (having been reported at 0.3 for the first 5 \times 3 models). Yet the pairwise se_diffs are virtually unchanged (again using A as the baseline of comparison):

        elpd_loo se_elpd_loo elpd_diff se_diff
B.huge -2086.812      40.991     2.525   5.029
A.huge -2089.337      41.025     0.000   0.000
C.huge -2090.590      41.015    -1.253   2.176

Was anything gained by this huge computational investment? se_diff has reportedly decreased by just 0.002, despite the MCMC_SEs having shrunk by two-thirds.

Or is the “gain” perhaps extant yet hidden from plain sight? If I could muster the computational resources to fit these huge models five times instead of just once, would the considerable variability observed across model-pair-wise elpd_diffs in the first code block now have disappeared?

So many questions aching for an answer…

Update: It took a long time and an astronomical number of CPU cycles, but I’ve now fit 5 more versions of each model, this time with 100k posterior simulations per model instead of the default 4k. It looks like my previous conjecture (based upon Aki’s input) was on the right track: While minimizing MCMC_SE will not reduce se_diff per se, it will reduce the variability of elpd_diffs across different fits:

> (pairwise.diffs <- cbind(elpd_diff = 
   lapply(6:10, function(x) loo_compare(get(paste0("A.",x,".loo")), get(paste0("B.",x,".loo")))), 
   function(y) ifelse(startsWith(rownames(y)[1], "A"), yes = y[2,"elpd_diff"], no = -y[2,"elpd_diff"]) ) ) )

[1,]  2.488917
[2,]  2.525023
[3,]  2.360932
[4,]  2.402948
[5,]  2.232506

> sd(pairwise.diffs) 
[1] 0.1152225 # MUCH less than with the smaller fits.

It looks more stable now. I’m reminded of what Aki said:

I guess that’s what we’re seeing here. It’s the same se_diffs on average, but with a 25-fold posterior sample size they cease to fluctuate all over the place.

I’m still curious what the “official” interpretation of se_diff is though…

My monologue continues after another day of rumination on all of the foregoing, as well as upon the concept of “standard error”:

In frequentism, the “standard error” of a statistic apparently refers to its expected standard deviation in the hypothetical scenario where we repeat the same data collection (same-sized sample from the same population) and calculations (e.g. fit/compare the same model(s)) an infinite number of times. For lack of better knowledge, this is how I’m also interpreting se_elpd_diff at the time of this writing.

For me, a source of significant confusion has been the fact that there are two different “layers of variability” at play. On one hand there’s se_elpd_diff itself, which I’m presently interpreting as described above (= expected variability of elpd_diff in repeat sampling and fitting). But then there’s also the variability of se_elpd_diff across different fits of the same pair of models to the same sample using different random seeds. In Post 2, Aki suggested that this is caused by the “uncertainty due to Monte Carlo variability” of each model-specific elpd_loo (henceforth labeled MCSE_ELPDLOO). And while I hopefully finally understand what se_elpd_diff means, I still don’t understand the relationship between MCMCSE_ELPDLOO and the variability of se_elpd_diff across fits of the same pair of models to the same sample.

The first five fits of my models (see Post 1) consisted of 4000 posterior draws each, and every MCSE_ELPDLOO was 0.3. Yet the observed variability across fits of the same pair of models to the same sample was much higher than anything I’d expect to arise from such small standard errors. Doesn’t the variance of a sum (or difference) equal the sum of the variances? Yet \sqrt{0.3^2 + 0.3^2} \approx 0.42, much less than the empirical variability of se_elpd_diff reported in Post 8. However, after reducing the MCSE_ELPDLOOs to 0.1 by drawing an expensive number of posterior samples, the empirical variability shrank much closer to zero (0.115) as well as closer to the expectation produced by the formula above (~0.14)

Summary of realizations so far: se_elpd_diff is the expected SD of elpd_diff in endlessly repeated fitting of the same pair of models to new samples of the same size from the same population, whereas the variability of se_elpd_diff, observed for the same pair of models fit to the same sample with different random seeds, is caused by MCSE_ELPDLOO through some complex and obscure mathematical mechanism. But as long as both models have been fit with enough posterior draws to effectively remove MCSE_ELPDLOO from the equation, elpd_diff will be reliable in the sense that even for two similarly performing models, it will not be positive with one random seed and negative with another.

Hopefully the gurus will correct whatever I got wrong.

More monologue recapitulating new (hopefully non-erroneous) insights:

Only when the summed variables are uncorrelated. Things become more complicated when they’re not.

This crucial distinction hasn’t been emphasized enough in the materials I’ve seen: the fact that there are, in fact, two layers of uncertainty in elpd_diff, and se_diff only describes one of them. For the uninitiated, matters are made worse by the fact that the models under comparison might have perfectly good convergence, with no warnings or red flags whatsoever, and the second layer of uncertainty in elpd_diff (its variability by random seed) might go completely unnoticed.

In teaching myself how independent these two uncertainty layers are, I found the following mock-data simulation to be helpful: suppose we’re predicting the height of working-age adults in two similar countries, and we are especially interested in whether the two populations differ. Suppose that they actually do — by a tiny amount — and our sample size is not sufficient to identify the difference reliably. Here’s the mock data:

N <- 400 # set insufficient sample size
set.seed(340)# for reproducibility

# Create "true" populations:
finland <- data.frame(sexF = rep(c(0,1), times = c(2765962,2868214)))
denmark <- data.frame(sexF = rep(c(0,1), times = c(2953238,2988150)))
# Generate y from "true" data-generating process:
finland$height <- with(finland, rnorm(nrow(finland), mean = 180 -sexF*15, sd = 6))
denmark$height <- with(denmark, rnorm(nrow(denmark), mean = 181 -sexF*15, sd = 6))

# Collect an inadequate sample from the "true" populations.
Sample <- rbind(cbind(country = "0Finland", finland)[sample(nrow(finland), N/2),], cbind(country = "Denmark", denmark)[sample(nrow(denmark), N/2),])

We use brms to fit two simple linear models, one excluding the “weak but real” predictor, the other one including it. We deliberately use a number of simulations that is sufficient to achieve convergence but insufficient to eliminate MCSE_ELPDLOO:

m1_500.1 <- brm(height ~ sexF, warmup = 1000, iter = 1000+500, prior = prior("", class = Intercept) + prior("", class = b), seed = 1, data = Sample, chains = 1, refresh = 0, silent = 2)
(m2_500.1 <- update(m1_500.1, ~. +country, newdata = Sample, seed = 1))

 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: height ~ sexF + country 
   Data: Sample (Number of observations: 400) 
  Draws: 1 chains, each with iter = 1500; warmup = 1000; thin = 1;
         total post-warmup draws = 500

Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        179.19      0.50   178.22   180.16 1.00      442      331
sexF             -14.37      0.56   -15.39   -13.30 1.00      531      409
countryDenmark     0.86      0.56    -0.26     1.94 1.00      527      397

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     5.94      0.20     5.56     6.34 1.00      574      438

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

When we calculate elpd_loos, note that MCSE is reported as small but NOT zero:

m1_500.1.elpd <- loo(m1_500.1); (m2_500.1.elpd <- loo(m2_500.1))

Computed from 500 by 400 log-likelihood matrix

         Estimate   SE
elpd_loo  -1281.6 13.6
p_loo         3.6  0.3
looic      2563.2 27.2
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

So, what can loo_compare() tell us about the relative predictive utility of these models?, m2_500.1.elpd)[,1:4])

          elpd_diff  se_diff  elpd_loo se_elpd_loo
m2_500.1  0.0000000 0.000000 -1281.582    13.58275
m1_500.1 -0.4358444 1.483557 -1282.018    13.52415

The difference reported by PSIS-LOO is negligible, but the ranking is correct. How reliably is the correct ranking reproduced in new pairwise comparisons obtained using different seeds? (I added the NA rows to separate the individual pairs for readability)

for(newseed in 2:10){
  modnames <- paste0(c("m1_500.","m2_500."),newseed)
  assign(modnames[1], update(m1_500.1, seed = newseed)) 
  tmp.a <- loo(get(modnames[1])); attributes(tmp.a)$model_name <- modnames[1]
  assign(modnames[2], update(m2_500.1, seed = newseed)) 
  tmp.b <- loo(get(modnames[2])); attributes(tmp.b)$model_name <- modnames[2]
  assign(paste0(modnames[1],".elpd"), tmp.a); assign(paste0(modnames[2],".elpd"), tmp.b)}

(comparisons.nsim500 <-, lapply(1:10, function(x) rbind(as.matrix(loo_compare(get(paste0("m1_500.",x,".elpd")), get(paste0("m2_500.",x,".elpd"))))[,1:4], " " = NA))))

            elpd_diff  se_diff  elpd_loo se_elpd_loo
m2_500.1   0.00000000 0.000000 -1281.582    13.58275
m1_500.1  -0.43584439 1.483557 -1282.018    13.52415
                   NA       NA        NA          NA
m1_500.2   0.00000000 0.000000 -1281.860    13.67334
m2_500.2  -0.01745390 1.398704 -1281.877    13.51334
                   NA       NA        NA          NA
m1_500.3   0.00000000 0.000000 -1281.899    13.62972
m2_500.3  -0.10765674 1.364659 -1282.007    13.62807
                   NA       NA        NA          NA
m1_500.4   0.00000000 0.000000 -1281.963    13.58248
m2_500.4  -0.05271535 1.398228 -1282.016    13.60424
                   NA       NA        NA          NA
m2_500.5   0.00000000 0.000000 -1281.810    13.62903
m1_500.5  -0.11476526 1.415954 -1281.925    13.52973
                   NA       NA        NA          NA
m2_500.6   0.00000000 0.000000 -1282.065    13.52476
m1_500.6  -0.07838442 1.461803 -1282.143    13.71377
                   NA       NA        NA          NA
m2_500.7   0.00000000 0.000000 -1281.787    13.58247
m1_500.7  -0.05540648 1.400752 -1281.842    13.63035
                   NA       NA        NA          NA
m2_500.8   0.00000000 0.000000 -1281.947    13.54562
m1_500.8  -0.30879311 1.442163 -1282.255    13.63053
                   NA       NA        NA          NA
m1_500.9   0.00000000 0.000000 -1281.862    13.57227
m2_500.9  -0.01564121 1.380227 -1281.878    13.60491
                   NA       NA        NA          NA
m1_500.10  0.00000000 0.000000 -1282.043    13.58460
m2_500.10 -0.01107914 1.398408 -1282.054    13.57277
                   NA       NA        NA          NA

Wow, it looks like the correctness of the initial comparison was pure luck! The new comparisons rank the models wrong in 5 out of 4 cases, and the reported differences are even smaller now! This is fully concordant with the notion that such small differences are indistinguishable from noise. Notice, however, that the differences do tend to be larger whenever the correct model is winning. So while the comparisons are hopelessly noisy, a tiny bit of signal can nonetheless be discerned — at least in an artificial setting where we know what “signal” to look for.

But given our knowledge that m2 is, indeed, the correct model, is it possible to reproduce the correct ranking with any reliability without collecting more data? Let’s see what happens when we refit the two models to the same inadequate sample 10 more times, now with 400 times more posterior draws.

for(Seed in 1:10){
modnames <- paste0(c("m1_200k.","m2_200k."), Seed)
assign(modnames[1], update(m1_500.1, seed = Seed, 
warmup = 1000, iter = 1000+200000)) 
tmp.a <- loo(get(modnames[1])); attributes(tmp.a)$model_name <- modnames[1]
assign(modnames[2], update(m2_500.1, seed = Seed, 
warmup = 1000, iter = 1000+200000)) 
tmp.b <- loo(get(modnames[2])); attributes(tmp.b)$model_name <- modnames[2]
rm(list = modnames)}


Computed from 200000 by 400 log-likelihood matrix

         Estimate   SE
elpd_loo  -1281.9 13.6
p_loo         2.9  0.3
looic      2563.9 27.2
Monte Carlo SE of elpd_loo is 0.0.


At the very least, MCSE_ELPDLOO is now estimated at zero. How do pairwise comparisons look between these new, gigantic fits?

(comparisons.nsim200k <-, lapply(1:10, function(x) rbind(as.matrix(loo_compare(get(paste0("m1_200k.",x,".elpd")), get(paste0("m2_200k.",x,".elpd"))))[,1:4], " " = NA))))

             elpd_diff  se_diff  elpd_loo se_elpd_loo
m2_200k.1   0.00000000 0.000000 -1281.895    13.55837
m1_200k.1  -0.04682957 1.447243 -1281.942    13.60188
                    NA       NA        NA          NA
m2_200k.2   0.00000000 0.000000 -1281.913    13.56032
m1_200k.2  -0.02713931 1.441532 -1281.941    13.60715
                    NA       NA        NA          NA
m2_200k.3   0.00000000 0.000000 -1281.906    13.55734
m1_200k.3  -0.03195478 1.442552 -1281.938    13.60323
                    NA       NA        NA          NA
m2_200k.4   0.00000000 0.000000 -1281.905    13.56055
m1_200k.4  -0.03989458 1.440435 -1281.944    13.60271
                    NA       NA        NA          NA
m2_200k.5   0.00000000 0.000000 -1281.905    13.55446
m1_200k.5  -0.04431128 1.440652 -1281.949    13.60623
                    NA       NA        NA          NA
m2_200k.6   0.00000000 0.000000 -1281.920    13.55854
m1_200k.6  -0.01312898 1.435604 -1281.933    13.60100
                    NA       NA        NA          NA
m2_200k.7   0.00000000 0.000000 -1281.894    13.55909
m1_200k.7  -0.04446383 1.444024 -1281.939    13.60896
                    NA       NA        NA          NA
m2_200k.8   0.00000000 0.000000 -1281.913    13.55801
m1_200k.8  -0.03009314 1.443213 -1281.943    13.60354
                    NA       NA        NA          NA
m2_200k.9   0.00000000 0.000000 -1281.909    13.56150
m1_200k.9  -0.03680435 1.442428 -1281.945    13.60413
                    NA       NA        NA          NA
m2_200k.10  0.00000000 0.000000 -1281.911    13.56087
m1_200k.10 -0.03440431 1.446022 -1281.945    13.60277
                    NA       NA        NA          NA

Elpd_diffs are still reported to be negligible. In the aggregate, they’re even smaller than in the first set of comparisons. The average of the reported se_diffs doesn’t improve either: in fact it gets slightly worse, the new average being 1.442 as opposed to 1.414 earlier (EDIT: earlier drafts had ridiculous miscalculations here). So, as Aki says, increasing posterior draws won’t help reduce se_diff.

Yet something remarkable was indeed gained: despite the infinitesimal magnitude of the differences, the models are now correctly ranked every single time. And this happens despite the fact that the se_diffs actually increase. What a revelation! Even to the mathematically challenged, this illustrates beyond question that it is indeed better to present a single comparison between two huge fits, rather than multiple comparisons between repeated small fits.

It sure would be nice to have a formula for estimating the degree to which elpd_diff is bound to vary by seed as a result of the respective MC_SEs of the loo objects. With complex multilevel models, obtaining posterior simulations in the hundreds of thousands is extremely time-intensive, and the resulting matrices of parameter values or pointwise LLs are so huge that running out of RAM is a frequent problem.

1 Like

More progress:

It seems that this is indeed what it means. Having now plodded laboriously through the Sivula, Magnusson, Matamoros & Vehtari preprint, se_diff seems equivalent to what the authors denote as \widehat{\text{SE}}_{\text{LOO}}(\text{M}_a,\text{M}_b|y), i.e. an estimator of the elpd_diff variability that would be expected in repeated cycles of re-sampling, re-fitting, and re-comparison.

Due to all the fitting undertaken hitherto, I now have a total of 10 brms fits of each operationalization, constituting a total of 520,000 posterior draws per operationalization. For each operationalization, I stacked together the 10 log-likelihood matrices using rbind(log_lik(fit1), log_lik(fit2), ...), then formed r_eff objects by calling loo::relative_eff(exp(BigLLMatrix), chain_id = rep(1, times = 520000)) on the result. The latter was done to eliminate loo’s complaints about not knowing relative effective sample sizes. And rep(1) seemed appropriate because due to within-chain parallelization, all fits consist of one chain only. Could someone confirm whether this procedure sounds correct so far, please?

This procedure failed in the end, due to prohibitive memory requirements: even with 76GB of ram (the maximum my HPC provider allows), the R session crashed due to “exceeding the memory limit”. But it worked — barely — when I used combined fits of “only” 300,000 draws. The resultant loo objects look like this:

Computed from 300000 by 2217 log-likelihood matrix

         Estimate   SE
elpd_loo  -2089.3 41.0
p_loo       246.1  8.0
looic      4178.5 82.0
Monte Carlo SE of elpd_loo is 0.0. # Finally!

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2210  99.7%   27741     
 (0.5, 0.7]   (ok)          7   0.3%   5526      
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

The resulting loo comparisons look like this (again using Operationalization A as baseline):

        elpd_loo se_elpd_loo elpd_diff se_diff
B.300k -2086.857      40.993     2.415   5.021
A.300k -2089.272      41.017     0.000   0.000
C.300k -2090.761      41.023    -1.489   2.169

… while the reported stacking weights look like this:

Method: stacking
A.300k 0.269 
B.300k 0.593 
C.300k 0.139 

While these results differ very little from those seen above, it would appear that A) now we can finally trust the elpd_diffs to be completely stable for this particular sample, and B) now we finally have a concise way to present them. As a bonus, we have hopefully finally managed to suss out what exactly it is that se_diff represents.

If no errors are pointed out, the thread is almost ready to declare solved. But I would still love to hear an expert opinion on whether there exists a general rule of thumb for guesstimating how much a pairwise loo comparison is likely to vary across seeds, judging only from the information available from a single pair of loo objects (such as MC_SE_ELPDLOO, p_loo, sample size, posterior dimensions …). Limiting oneself to the default 4,000 iterations saves enormous amounts of time + is often sufficient for convergence, but we’ve now seen that it can be completely insufficient to stabilize elpd_diff across seeds. So, having a heuristic for estimating when exactly it is that you need more draws for a stable elpd_diff would seem quite desirable indeed.

That is the MCSE estimate. For the commonly used MCMC sample sizes and given khats<0.7, I’ve always observed negligible MCSE compared to diff_se’s. This holds also for your experiments, Thus, my general rule of thumb is that given 4000 posterior draws and khats<0.7 there is no need for more draws (as the rule is so simple we haven’t discussed it more in the documentation).

1 Like

So, to use the notation of Sivula et al, if loo() estimates the MCSE of both \widehat{\text{elpd}}_\text{LOO}(\text{M}_a~|~y) and \widehat{\text{elpd}}_\text{LOO}(\text{M}_b~|~y) to be 0.4, does it mean I should expect \widehat{\text{elpd}}_\text{LOO}(\text{M}_a, \text{M}_b~|~y) to vary by an average of (0.4^2 + 0.4^2)^{1/2} \approx 0.57 across seeds?

My own experience has been different. Here’s a recent example from real data:

> malli01_CompVerbPolysyl_loo

Computed from 4000 by 2217 log-likelihood matrix

         Estimate   SE
elpd_loo  -2091.5 41.0
p_loo       272.1  8.4
looic      4183.0 82.1
Monte Carlo SE of elpd_loo is 0.4.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2203  99.4%   104       
 (0.5, 0.7]   (ok)         14   0.6%   329       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

> malli01_CompVerbSylLog2_loo

Computed from 4000 by 2217 log-likelihood matrix

         Estimate   SE
elpd_loo  -2089.5 40.9
p_loo       269.9  8.3
looic      4179.0 81.8
Monte Carlo SE of elpd_loo is 0.4.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2196  99.1%   50        
 (0.5, 0.7]   (ok)         21   0.9%   456       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

Here’s the comparison:

                        elpd_diff se_diff
malli01_CompVerbSylLog2  0.0       0.0   
malli01_CompVerbPolysyl -2.0       1.0   

Warning message:
Not all models have the same y variable. ('yhash' attributes do not match) 

(The hash warning appears due to the fact that the models were fit using different versions of R and brms).

I found this ranking unexpected, so refit both models 10 times. Here’s the last comparison:

> loo_compare(malli01_CompVerbPolysyl11_loo, malli01_CompVerbSylLog2k_loo)
                          elpd_diff se_diff
malli01_CompVerbPolysyl11  0.0       0.0   
malli01_CompVerbSylLog2k  -2.7       1.1   

Warning message:
Not all models have the same y variable. ('yhash' attributes do not match) 

So basically, the difference changes by 4.7 between these fits. If the above calculation of Monte Carlo SE for the difference is correct, that’s more than 8 standard errors!

Something seems off.

Probably in a nice case.

Thanks for sharing this. That MCSE estimate is already unusually big. Small khats together with small Min. n_eff values indicate slow mixing of Markov chains. With that small n_eff’s for the pointwise elpd’s, I would expect that that MCSE estimate is too optimistic (this can happen with finite data even if the used estimator would be good asymptotically). The issue is further complicated as the interpretation of MCSE assumes the normal distributed error, while the true error is skewed and leptokurtic. It seems that in your case there is strong non-normality. This can happen if there are a small number of very low probability predictions. If you can share those 4000 by 2217 log-likelihood matrices from the two models, I can look further.

Although there is a bigger variation in the elpd_diff than expected from MCSE, the conclusions from both of these runs are the same, that is, there is no detectable difference between the models.

1 Like

I don’t think there’s any mixing of chains. Fitting was done with threading and within-chain parallelization with chains = 1 to minimize computation time.

They’re 130 megs together, but hopefully this temp file hosting link will work:

I’m using the word mixing here in the usual Markov chain meaning which holds also for one chain. Slow mixing is observed, for example, by high autocorrelation time estimate which is used to compute the effective sample size (n_eff in the loo diagnostic output, but we prefer now acronym ESS)

It seems I was able to download it, but I don’t know when I have time to investigate it

1 Like

Got it, thanks!

NP, I’ll wait.

I have new questions that are sufficiently related to the thread that I thought it best not to start a new one. (admins feel free to move this to a new thread if you want)

I’m presenting lots of model comparisons, including lots of small elpd_loo differences. And I can’t get rid of the Monte Carlo error of the comparisons – the models are too complex and there are too many of them for me to eliminate MC SE by drawing 300,000 simulations from every one.

Since I know that the elpd differences have significant Monte Carlo variability, should I perhaps present the estimated MC SE in addition to se_diff when presenting my comparisons? And if so, would it be better to show the MC_SE that is reported in each individual loo object, or would it be better to present the ad-hoc estimated MC_SE_DIFF calculated using the formula (\text{MCSE_A}^2 + \text{MCSE_B}^2)^{1/2}

Also: when comparing LOOICs instead of elpd_loos, is the MC SE of each model-wise LOOIC simply twice the MC SE of elpd_loo?

I hope someone can help.