bayes_R2, loo_R2 and hs() priors

I am exploring a fairly straigthfoward multiple regression model

fit <- stan_glm(Photo ~ WD + Ks + KL_Units + BW_dry_avg + Thick_cm + Area_cm2 + LMA_g.cm2 + 
                LWR + LD_g.cm3 + Length_mm_Top + Density_mm2_Top + SPI_Top,
                data = dat,
                family = gaussian(),
                refresh = 0)

                Median MAD_SD
(Intercept)      0.000  0.080
WD               0.058  0.104
Ks               0.109  0.114
KL_Units        -0.048  0.114
BW_dry_avg      -0.078  0.181
Thick_cm        -0.407  0.622
Area_cm2         0.021  0.168
LMA_g.cm2        0.144  0.577
LWR             -0.337  0.172
LD_g.cm3         0.140  0.423
Length_mm_Top   -1.820  0.566
Density_mm2_Top -1.021  0.343
SPI_Top          0.930  0.327

Since the number of samples is fairly small for that number of covariates (151), I also ran the model using regularized horseshoe priors

n <- nrow(dat)
D <- ncol(fit_default_df) - 2
p0 <- 3
tau0 <- p0/(D - p0) * 1/sqrt(n)
prior_coeff <- hs(global_scale = tau0, slab_scale = 1)

fit_hs<- stan_glm(Photo ~ WD + Ks + KL_Units + BW_dry_avg + Thick_cm + Area_cm2 + LMA_g.cm2 + 
                  LWR + LD_g.cm3 + Length_mm_Top + Density_mm2_Top + SPI_Top,
                  data = dat,
                  family = gaussian(),
                  refresh = 0,
                  prior = prior_coeff)

                Median MAD_SD
(Intercept)      0.000  0.079
WD               0.000  0.018
Ks               0.000  0.018
KL_Units         0.000  0.018
BW_dry_avg      -0.004  0.022
Thick_cm        -0.038  0.066
Area_cm2         0.000  0.020
LMA_g.cm2       -0.003  0.025
LWR             -0.097  0.144
LD_g.cm3         0.028  0.055
Length_mm_Top   -0.026  0.055
Density_mm2_Top  0.000  0.022
SPI_Top         -0.003  0.024

So far so good. I was curious to see how well (in an R^2 sense) the regularized horseshoe priors did compared to the default priors.

median(bayes_R2(fit))
[1] 0.237
median(bayes_R2(fit_hs))
[1] 0.065

(I truncated the output to 3 decimals.) Interestingly, loo_R2() shows much smaller differences:

median(loo_R2(fit))
[1] 0.058
median(loo_R2(fit_hs))
[1] 0.027

Are differences as large as what I saw in bayes_R2() expected? Is there a reason why comparing bayes_R2 between models with different priors doesn’t make sense? Is there a reason why the difference in loo_R2() appears to be smaller?

Kent

For small datasets, perhaps.

It evaluates the models using the same data that were conditioned on, which distorts things a little bit in small samples.

Because it estimates the error for the i-th observation using all but the i-th observation, so it is much less prone to the previously mentioned problem.

Thank you. I haven’t looked at the algebra behind the loo_R2() yet, but I understand (at least I understand in principle), elpd_loo and the like. My understanding is that elpd_loo is conceptually similar to (if not the same thing) as the average conditional predictive ordinate that Gelfand and Dey suggested 10 or 20 years ago. Does loo_R2() use CPOs (or an analog) or is it based on the predictions themselves, making it analogous to Predictive Error Sum of Squares (PRESS) from the 70s or 80s?

The code for it is barely 50 lines long

I think you would say that it is more similar to PRESS.

Thank you. This is extremely helpful. (And thank you for your patience.)

They are believable. If you show loo(fit) and loo(fit_hs) full output , I can say more.

Also instead of median, try plotting histograms, e.g.

ggplot() + geom_histogram(aes(x=bayes_R2(fit)), breaks=seq(0,1,length.out=100)) +
  xlim(c(0,1)) +
  scale_y_continuous(breaks=NULL) +
  labs(x="Bayesian R^2", y="")

and

ggplot() + geom_histogram(aes(x=loo_R2(fit)), breaks=seq(0,1,length.out=100)) +
  xlim(c(0,1)) +
  scale_y_continuous(breaks=NULL) +
  labs(x="LOO R^2", y="")

It’s the same but with more robust computation and with self diagnostic (and it’s almost 30 years).

See also Bayesian R2 and LOO-R2

The histograms look quite different from one another.

Here’s bayes_R2(fit)


and here’s loo_R2(fit)

> loo(fit)

Computed from 4000 by 151 log-likelihood matrix

         Estimate   SE
elpd_loo   -211.1 11.1
p_loo        14.2  2.3
looic       422.2 22.2
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     150   99.3%   478       
 (0.5, 0.7]   (ok)         0    0.0%   <NA>      
   (0.7, 1]   (bad)        1    0.7%   63        
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
Warning message:
Found 1 observation(s) with a pareto_k > 0.7. We recommend calling 'loo' again with argument 'k_threshold = 0.7' in order to calculate the ELPD without the assumption that these observations are negligible. This will refit the model 1 times to compute the ELPDs for the problematic observations directly.

Re- running with k_threshold = 0.7

> loo(fit, k_threshold = 0.7)
1 problematic observation(s) found.
Model will be refit 1 times.

Fitting model 1 out of 1 (leaving out observation 102)

Computed from 4000 by 151 log-likelihood matrix

         Estimate   SE
elpd_loo   -211.1 11.1
p_loo        14.1  2.3
looic       422.1 22.2
------
Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo(fit_hs)

Computed from 4000 by 151 log-likelihood matrix

         Estimate   SE
elpd_loo   -213.0 10.4
p_loo         8.1  1.5
looic       426.1 20.9
------
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

There are 14 parameters in the regression (12 covariates plus the intercept and sigma), which is close to p_loo in the model with default priors and quite a bit larger than p_loo in the model with regularized horseshoe priors.

It’s hard to believe that it’s been almost 30 years since Gelfand and Dey, but come to think of it, I remember Dipak telling me about CPOs about 15 years ago, and the paper was 10-15 years old then!

Thanks for your help. I’ve seen the Github page before. It’s very useful.

Kent

But not surprisingly different.

loo diagnostics look good, so I’d say you can trust LOO-R2 results.

Thank you for the very helpful advice. After reading through https://avehtari.github.io/bayes_R2/bayes_R2.html again, I think I have a good handle on why it’s not surprising that bayes_R2() and loo_R2() give rather different results. Let me see if I’ve got this right. For a linear regression

  1. bayes_R2() uses (\sigma^2)^{(s)} rather by \frac{1}{n-1}\sum_n (y_i - \hat y_i^{s})^2 as its estimate of the residual variance at iteration s, where \hat y_i^{(s)} is the estimate of y_i at iteration s.
  2. loo_R2() uses (\sigma^2)^{(s)} rather by \frac{1}{n-1}\sum_n (y_i - \hat y_{loo,i}^{s})^2 as its estimate of the residual variance at iteration s, where \hat y_{loo,i}^{(s)} is the leave-one-out estimate of y_i at iteration s.

We expect loo_R2() < bayes_R2(), because the residuals in loo_R2() are deviations from an estimate that doesn’t include the observation rather than deviations from an estimate that does.