Mixis and E_loo

gregd · October 28, 2025, 7:00pm

hello

E_loo uses the weights generated by loo to estimate the leave-one-out mean (and quantiles). However, “A small Pareto k estimate is necessary, but not sufficient, for E_loo() to give reliable estimates.” Mixis is more robust for estimating elpd. Can the log-weights generated by mixis be used to improve the estimate of the leave-one-out mean?

Below is a simple gaussian example with synthetic data including one clear outlier. Putting eloov as the vector of mean values obtained from E_loo, I compare the true (brute force) expected value at the outlier, eloov at the outlier, and a proposed way to use mixis to give an another estimate, which is better at the outlier in this example. Is this a valid method for this purpose? Is there a better way? Does it perform better than eloov in general?

thanks

Greg

R

library(brms)
library(rstan)
rstan_options(auto_write = TRUE)
library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend="cmdstanr")
library(ggplot2)
library(loo)
library("matrixStats")

set.seed(1)
N=400
x<-runif(N)
f<-function(t) $1.8+sin(12*t)*exp(-3*t)$
y<-$f(x)+0.3*rnorm(N)$
y[1]=-1


dat<-data.frame(x,y)
bf1<-bf(y~s(x))
mod1<-brm(bf1,data=dat,family="gaussian",adapt_delta=0.9)

LL<-log_lik(mod1)
eLL<-exp(LL)

cid<-c(rep(1,1000),rep(2,1000),rep(3,1000),rep(4,1000))
reff<-relative_eff(eLL,chain_id=cid)

loo1<-loo(LL,r_eff=reff,save_psis=TRUE)
loo1
loopt1<-loo1$pointwise
head(loopt1)

pospred<-posterior_epred(mod1)

psis_object<-loo1$psis_object
log_ratios<- -1*LL
eloo<-E_loo(pospred,psis_object,type="mean",log_ratios=log_ratios)
eloov<-eloo$value

#brute force estimate

datx1<-dat[-1,]
modx1<-brm(bf1,data=datx1,family="gaussian",adapt_delta=0.9)
newd<-data.frame(x=x[1])

ppd<-posterior_predict(modx1,newdata=newd)
mean(ppd)
#1.787043
eloov[1]
#1.795551

#mixis

cat(mod1$model,file="mod1.stan")
#save as mod1mix.stan
#amend the model block with mixis code lines
#amend the generated quantities block to generate mu and log_lik
#save as mod1mix.stan

the two amended blocks now read

model {
  // likelihood including constants
  if (!prior_only) {
    // initialize linear predictor term
    vector[N] log_lik = rep_vector(0.0, N);
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
    for (n in 1:N) {
      log_lik[n] = normal_lpdf(Y[n] | mu[n], sigma);
    }
    target += sum(log_lik);
    target += log_sum_exp(-log_lik);
  }
  // priors including constants
  target += lprior;
  target += std_normal_lpdf(zs_1_1);
}
generated quantities {
  // actual population-level intercept
  real b_Intercept = Intercept;
    vector[N] log_lik = rep_vector(0.0, N);
    vector[N] mu = rep_vector(0.0, N);
    mu += Intercept + Xs * bs + Zs_1_1 * s_1_1;
    for (n in 1:N) {
      log_lik[n] = normal_lpdf(Y[n] | mu[n], sigma);
    }

#make Cmdstanr model and run it

mod1.mix<-cmdstan_model("mod1mix.stan")
class(mod1.mix)
mod1.mix$print()

stand<-standata(mod1)

mix1<-mod1.mix$sample(data=stand,chains=4,parallel_chains=2,refresh=100,
iter_warmup=1000,iter_sampling=1000,save_warmup=TRUE,max_treedepth=13,adapt_delta=0.9)
mix1$save_object(file="mix1.RDS")
#mix1<-readRDS("mix1.RDS")

#elpd from mixis

draws<-mix1$draws()
drawmat<-as_draws_matrix(draws)
dim(drawmat)

grep("log_lik",colnames(drawmat))[c(1,N)]
dwLL<-drawmat[,24:423]
dim(dwLL)
colnames(dwLL)[c(1,N)]

l_common_mix <- rowLogSumExps(-dwLL)
log_weights <- -dwLL - l_common_mix
elpd_mixis <- logSumExp(-l_common_mix) - rowLogSumExps(t(log_weights))
save(elpd_mixis,file="elpd_mixis.rds")

#compare elpd_loo
elpd_loo<-loopt1[,1]
plot(elpd_mixis,elpd_loo)

#proposed method for E_mix

lw<-log_weights
dim(lw)
#make normalised weights
lse<-apply(lw,2,logSumExp)
lseo<-outer(rep(1,4000),lse)
lwn<-lw-lseo
w<-exp(lwn)
dim(w)
range(w)
apply(w,2,sum)[1:10]
dwmu<-drawmat[,424:823]
dim(dwmu)
colnames(dwmu)[c(1,N)]

emix<-apply(dwmu*w,2,sum)
plot(emix,eloov)

mean(ppd)
#1.787043
eloov[1]
#1.795551
emix[1]
#1.783195

avehtari · October 30, 2025, 2:27pm

Thanks for bringing this up as it turns out we have forgot to update the documentation. The text continues with

Additional diagnostic checks for gauging the reliability of the estimates are in development and will be added in a future release.

and these additional diagnostics have been added!

The current version computes h-specific Pareto k (see Pareto Smoothed Importance Sampling paper), and when running E_loo() if you don’t get warnings, then that h-specific Pareto k is also small enough.

Yes.

Using Mixture-IS is valid and can be expected to perform better, but it can also fail. Unfortunately, I don’t have time to check your code in detail right now.

Moment matching might work, too.

avehtari · October 30, 2025, 2:36pm

I made a PR

gregd · October 30, 2025, 2:49pm

thanks very much for replying.

The key bits of the code are at the end. The idea is to take the log_weights generated during mixis, normalise and exponentiate them, and apply the normalised weights to the mu values from the generated quantities block included when running the mixis. It improved the estimate in this example - hopefully someone has tested it out more generally - or could do? This is analogous to how E-loo generates the mean by applying loo weights, but the mixis weights are from a different context, they are the probability that a value sampled from the mixture actually arose from leaving out the point in question.

thanks

Greg

gregd · November 23, 2025, 9:06am

as far as I can tell, the current version of loo on CRAN is 2.8.0

this does not appear to show diagnostics for E_loo

But I’m more confident now of the above method to estimate the leave-one-out mu values from mixis, because in my actual data with 6773 rows and 51 covariates including nested factor terms and a beta-binomial model, the E_loo and mixis predicted values as above are extremely similar even though the pointwise elpd values have some differences.

thanks

Greg

jonah · November 24, 2025, 6:30pm

I think Aki was referring to the Pareto k diagnostic that is included in the list returned by E_loo. There are more details in the Value section at Compute weighted expectations — E_loo • loo

gregd · November 24, 2025, 10:51pm

I thought he was referring to this sentence “Additional diagnostic
checks for gauging the reliability of the estimates are in development
and will be added in a future release.”

I understood from his post that such additional checks have indeed been
implemented, so I’d like to try them. I just don’t know where to look
for them!

sorry if I’m missing something.

Greg

jonah · November 24, 2025, 11:13pm

Oh, yeah I see why that’s confusing, sorry. The additional checks ended up being updating how the Pareto k diagnostic is computed to use more information when available. If the log_ratios argument is also specified then E_loo can compute a Pareto k diagnostic specific to the particular function that E_loo is using (the function specified via the type argument), which is better than the k diagnostic computed otherwise. We had forgotten to remove that sentence from the documentation after adding that.

gregd · November 25, 2025, 8:20am

ok thanks Jonah so it’s still the Pareto-k diagnostic, but computed
using the log_ratios.

In that case, my actual example (not the simple synthetic above) is one
in which loo, using the log_ratios, has some bad and a few very bad
points, which is why I ended up with MixIS at Aki’s suggestion. The main
motive was for model comparison. If elpd_mixis is more accurate, that
should mean the comparisons (analogous to loo_compare and with se
computed analogously as (N*V)^0.5 where V is the sample variance of the
elpd_mixis pointwise differences between the two models) are more valid.
But I also wanted a MixIS version of E_loo. The suggestion above is for
a MixIS version when type=“mean”. It makes sense (to me) and appears to
work, giving very similar results to E_loo for my actual data, even
though elpd differs a bit.

thanks

Greg

avehtari · November 25, 2025, 1:38pm

@gregd , I’ve been busy and on vacation, but now had time recheck your code and it seems to be fine.

Thanks for the feedback on the documentation and vignettes. We know they can be improved, but it also increases motivation knowing someone is using the methods. I’ll add to my todo list on adding more examples. For mixis, we also need to to some additional research for the diagnostics as Pareto-k may be too pessimistic as in mixis by construction the weights are bounded.

In your code, you could use subset_draws(..., variable=...) instead of numeric indexing.

gregd · November 25, 2025, 1:48pm

thanks Avi

that’s a relief (re code)!

thanks

Greg

gregd · December 3, 2025, 8:56am

thanks again Aki, Jonah

If anyone has time or interest to look, here is the preprint of the article in which I applied this technique, and used MixIS for model comparison and in the identification of possible outliers. That last point is a bit experimental, but the purpose was not a rigorous declaration that a particular point IS AN OUTLIER, but rather a signal to investigate it for unmodelled issues.

All codes are in the Additional Files. Model comparisons are in Table 2. MixIS vs LOO is in Fig 8. Outliers are in Fig 7.

thanks again

Greg

avehtari · December 8, 2025, 7:29am

Thanks for sharing. I took a quick look and the preprint looks good

gregd · December 8, 2025, 8:13am

thanks very much Aki. In fact I can improve on Fig 2 and skip the reliance on permuted samples, as the mean random effect of REG is itself a distribution rather than a constant.
Fig2x.pdf (5.5 KB)
The correction has almost no effect on Fig 3 and leaves the top and bottom 10 unchanged. The rest of the argument is not affected.

luchinoprince · January 10, 2026, 2:46pm

Hello!

Thanks for highlighting this example. The procedure you’ve described is mathematically sound and serves as a robust alternative, particularly in high-dimensional settings or when dealing with heavy outliers. I also noticed you’re correctly using logSumExp / rowLogSumExps to handle the log-weights—this is crucial for maintaining numerical stability when re-exponentiating.

In general, the variance of a self-normalized IS estimator can be decomposed in a way similar to Formula S.7 in the MixIS supplement. The final term in that decomposition represents the variance of the weights of the IS distribution itself, which remains independent of the functional being estimated.

In cases where there is a risk of infinite variance in the weights of the full posterior distribution—common in outlier-heavy or high-D scenarios—MixIS offers a stable alternative which has been shown to work very well in a wide variety or real scenarios. While PSIS is excellent for stabilizing weights to ensure finite variance, there are cases (as we discussed in the paper) where stabilization alone may not be sufficient for reliable estimation. We provide some theoretical justification for this in the paper by highlighting the difference between finite sample variance, and finite asymptotic variance.

That said, the method does involve certain trade-offs:

Computational Overhead: It requires a second sampling run. For extremely expensive models, this additional cost is a meaningful consideration for the user.
Mixture Complexity: Generating samples from a mixture can be a concern. In our experiments with regression scenarios, the significant overlap in leave-one-out posteriors made this not an issue, though it could require more care in other types of models.
Variance Balancing: When the posterior weight variance is already finite, the “best” estimator is the one that optimally balances the variance of the IS weights, the variance of the functional, and their correlation. It’s important to look at the full decomposition (like Formula S.7 for ELPD) rather than just the weight variance in isolation. Hence this method cannot be seen as a silver bullet which is always better then eloov.

NB: Formula S.7 of the supplement actually approximates the relative asymptotic variance, rather then the asymptotic variance. This was due to a delta-method approximation in computing the variance of the terms of ELPD. An analogous form can be obtained for the asymptotic variance.

gregd · January 13, 2026, 5:02pm

hello

Thanks for the encouraging first para - the only one I understood! But anyway, do you have a comment on the point I raised in replying to Jonah on Nov 25 8:20: If elpd_mixis is more accurate than elpd_loo, that should mean the comparisons (analogous to loo_compare and with se computed analogously as (N*V)^0.5 where V is the sample variance of the elpd_mixis pointwise differences between the two models) are more valid.

I hope this is true! But I’d appreciate any comment on how model comparisons should be done with elpd_mixis.

thanks

Greg

luchinoprince · January 13, 2026, 7:49pm

Hello,

I am not super familiar with the Rstan package, at least not the functions you are mentioning. So if you want my point-of-view regarding your questions I would need you to translate it from functions and acronyms to mathematical objects, in that case I can comment with more confidence, as currently I am not sure I understood your question.

Luca

gregd · January 13, 2026, 10:45pm

hello Luca,

model comparison using loo is described in Vehtari (2016)
https://arxiv.org/pdf/1507.04544
in Section 5.2
My question is whether this would be an appropriate method when using the elpd estimates from MixIS. (I hope so!)

Example:

library(brms)
library(rstan)
rstan_options(auto_write = TRUE)
library(brms)
library(cmdstanr)
options(mc.cores=2, brms.normalize=TRUE,brms.backend="cmdstanr")
library(loo)
library("matrixStats")

set.seed(1)
N=400
x1<-runif(N)
x2<-runif(N)
f<-function(t) 1.8+sin(12*t)*exp(-3*t)
g<-function(t) -0.2+t^2
y<-f(x1)+g(x2)+0.3*rnorm(N)
y[1:5]=y[1:5]+2*runif(5)

plot(x1,y)
points(x1[1:5],y[1:5],pch=16,col=2)

dat<-data.frame(x1,x2,y)
bf1<-bf(y~s(x1))
bf2<-bf(y~s(x1)+s(x2))
mod1<-brm(bf1,data=dat,family="gaussian",adapt_delta=0.95)
mod2<-brm(bf2,data=dat,family="gaussian",adapt_delta=0.95)
bayes_R2(mod1)
bayes_R2(mod2)

#as expected, mod2 seems better. To assess the significance of the improvement:

LL1<-logLik(mod1)
LL2<-logLik(mod2)

eLL1<-exp(LL1)
eLL2<-exp(LL2)

cid<-c(rep(1,1000),rep(2,1000),rep(3,1000),rep(4,1000))
reff1<-relative_eff(eLL1,chain_id=cid)
reff2<-relative_eff(eLL2,chain_id=cid)

loo1<-loo(LL1,r_eff=reff1,save_psis=TRUE)
loo2<-loo(LL2,r_eff=reff2,save_psis=TRUE)

loo_compare(loo1,loo2)

#       elpd_diff se_diff
#model2    0.0       0.0 
#model1 -112.2      11.9 

#to see that se_diff is being calculated as in the Vehtari paper:

loopt1<-loo1$pointwise
head(loopt1)
loopt2<-loo2$pointwise

diff<-loopt1[,1]-loopt2[,1]
V<-var(diff)
sumdif<-sum(diff)
sumdif
se<-(N*V)^0.5
se
print(loo_compare(loo1,loo2),digits=4)
#The Vehtari paper is suggesting that the significance of model comparison can be assessed by sumdif with standard error se 

#now using MixIS

cat(mod1$model,file="mod1.stan")
#save as mod1mix.stan
#amend the model block with mixis code lines
#amend the generated quantities block to generate mu and log_lik
#save as mod1mix.stan

cat(mod2$model,file="mod2.stan")
#amend likewise and save as mod2mix.stan

mod1.mix<-cmdstan_model("mod1mix.stan")
class(mod1.mix)
mod1.mix$print()

stand1<-standata(mod1)

mix1<-mod1.mix$sample(data=stand1,chains=4,parallel_chains=2,refresh=100,
iter_warmup=1000,iter_sampling=1000,save_warmup=TRUE,max_treedepth=13,adapt_delta=0.95)
mix1$save_object(file="mix1.RDS")
#mix1<-readRDS("mix1.RDS")

#elpd from mixis

draws1<-mix1$draws()
drawmat1<-as_draws_matrix(draws1)
dim(drawmat1)

grep("log_lik",colnames(drawmat1))[c(1,N)]
dwLL1<-drawmat1[,24:423]
dim(dwLL1)
colnames(dwLL1)[c(1,N)]

l_common_mix1 <- rowLogSumExps(-dwLL1)
log_weights1 <- -dwLL1 - l_common_mix1
elpd_mixis1 <- logSumExp(-l_common_mix1) - rowLogSumExps(t(log_weights1))
save(elpd_mixis1,file="elpd_mixis1.rds")

plot(elpd_mixis1,loopt1[,1])
range(elpd_mixis1-loopt1[,1])

#repeat with mod2

mod2.mix<-cmdstan_model("mod2mix.stan")
stand2<-standata(mod2)

mix2<-mod2.mix$sample(data=stand2,chains=4,parallel_chains=2,refresh=100,
iter_warmup=1000,iter_sampling=1000,save_warmup=TRUE,max_treedepth=13,adapt_delta=0.95)
mix2$save_object(file="mix2.RDS")
#mix2<-readRDS("mix2.RDS")

draws2<-mix2$draws()
drawmat2<-as_draws_matrix(draws2)
dim(drawmat2)

grep("log_lik",colnames(drawmat2))[c(1,N)]
dwLL2<-drawmat2[,42:441]
dim(dwLL2)
colnames(dwLL2)[c(1,N)]

l_common_mix2 <- rowLogSumExps(-dwLL2)
log_weights2 <- -dwLL2 - l_common_mix2
elpd_mixis2 <- logSumExp(-l_common_mix2) - rowLogSumExps(t(log_weights2))
save(elpd_mixis2,file="elpd_mixis2.rds")

plot(elpd_mixis2,loopt2[,1])
range(elpd_mixis2-loopt2[,1])

mixdif<-elpd_mixis1-elpd_mixis2
sum(mixdif)
sum(diff)

mixV<-var(mixdif)
mixV
V
se<-(N*V)^0.5
mixse<-(N*mixV)^0.5
se
mixse

#in this example, MixIS and loo are giving very similar results, but that won’t be true in general.
#Question: is it valid to assess the significance of a model comparison by considering mixse as the standard error of sum(mixdif)?

thanks

Greg

avehtari · January 14, 2026, 2:34pm

It doesn’t matter how you compute the pointwise elpd values. Assuming the pointwise elpd values are relatively accurately computed, then model comparison based on the difference and se is based on the same approach. See also Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison for more details on when elpd_diff and se_diff based normal approaximation for the uncertainty in the difference can be expected to be well calibrated.

luchinoprince · January 14, 2026, 2:39pm

Hello Greg,

I had a brief look at the reference you provided, I assume what the label as \hat ELPD_loo is the standard approximation using the posterior distribution, or PSIS, correct?

I am not an expert in the mode comparison literature, but I think its a safe bet to assume that if you believe elpd_mixis to be more accurate for a given model compared to elpd_psis-loo, due to the presence of outliers or high-dimensionality frameworks, then any other decision taken on those estimates would be more robust; hence in this case model comparison.

But again this has to be assessed on a case case to basis taking into account the things I mentioned in the above comment, and cannot be seen like a silver bullet solution

Topic		Replies	Views
Loo_compare in the presence of high pareto-k brms loo	6	562	June 29, 2025
Loo_moment_match failure beta_binomial example brms loo , brms	4	258	June 27, 2024
Diagnostics for Mixture IS leave-one-out cross-validation? Algorithms loo	1	74	July 21, 2025
Alternative to LOO for simulation studies Modeling loo , posterior-predictive , model-comparison	14	1305	November 17, 2020
ArviZ loo_i Interfaces loo , arviz	19	1777	May 11, 2021

Mixis and E_loo

Related topics