Marginalisation and efficiency

This is a conceptual question and very likely stems from my own ignorance. Suppose you have a (toy) model like this:
Y \sim \text{Poisson}(\lambda),
\lambda \sim \text{Exponential}(1/\theta),
\theta \sim \text{Inverse-Gamma}(\alpha, \beta).

One can specify a program like

data{
  int<lower=0> K;
  int<lower=0> y[K];
  real<lower=0> alpha;
  real<lower=0> beta;
}
parameters{
  real<lower=0> lambda;
  real<lower=0> theta;
}
model{
  target+= poisson_lpmf(y | lambda);
  target+= exponential_lpdf(lambda | 1/theta);
  target+= inv_gamma_lpdf(theta | alpha, beta);
}

to sample from this model and it works really well, as expected.
One can try to be smart and integrate out \theta, specifying the program

data{
  int<lower=0> K;
  int<lower=0> y[K];
  real<lower=0> alpha;
  real<lower=0> beta;
}
parameters{
  real<lower=0> lambda;
}
model{
  target+= poisson_lpmf(y | lambda);
  target+= log(alpha) + alpha*log(beta)-(alpha+1)*log(lambda + beta);
}

which samples well, but achieves much lower ESS. Here’s the output for the full model and the marginalised one:

Inference for Stan model: toy_full.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
lambda     1.05    0.00 0.03     0.99     1.03     1.05     1.07     1.11  2925    1
theta      5.45    0.11 4.71     1.51     2.77     4.10     6.46    17.90  1995    1
lp__   -1329.94    0.03 1.05 -1332.78 -1330.34 -1329.61 -1329.19 -1328.92  1582    1
Inference for Stan model: toy_marginalised.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
lambda     1.05    0.00 0.03     0.99     1.03     1.05     1.07     1.12  1253    1
lp__   -1329.69    0.02 0.71 -1331.67 -1329.85 -1329.42 -1329.24 -1329.19  1541    1

I naively expected the marginalisation to improve sampling efficiency, but I guess there’s no guarantee that marginalising leads to better geometry, eh?

Tagging @betanalpha @avehtari @andrewgelman @bgoodri @Bob_Carpenter and @martinmodrak and apologising in advance if this is a stupid/trivial question.

3 Likes

A quick thought: efficiency is usually measured in ESS per unit of time, not ESS per iteration. Is the decrease in ESS larger or smaller than the decrease in time? I will leave discussion of actual causes to others, I would be just speculating.

1 Like

As a rough guess, the difference might be due to the loss of analytical gradients by moving away from the lpdf functions, since the auto-diffed gradients might not be as efficient (or resistant to over-/under-flow) as the analytical ones

1 Like
  • in small dimensional examples there can be a just a small difference and marginalisation has a bigger effect when marginalizing over many dimensions
  • can you show n_leapfrog__ (e.g. with mean(rstan::get_num_leapfrog_per_iteration(fit)))?
  • can you show Tail_ESS (e.g. with recent RStan rstan::monitor(fit))?
1 Like

Yeah, sorry I should’ve done that. Run times are usally larger for the marginalised version, so I didn’ bother.

This is a very insightful comment. It might very well be that, although it’d be a bit of work to test it: I guess I could write out the marginalised density on \lambda in C++ and equip it with an analytical gradient.

Yeah, that’s the realisation I arrived at as well.

Yup, here it is:

Yes:

> rstan::monitor(full.samples)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 0):

           Q5    Q50    Q95   Mean  SD  Rhat Bulk_ESS Tail_ESS
lambda    1.4    1.5    1.7    1.5 0.1     1     2564     2393
theta     0.4    0.9    3.0    1.2 1.0     1     2500     2063
lp__   -315.0 -312.7 -312.1 -313.0 1.0     1     1940     2001
rstan::monitor(marginalised.samples)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 0):

           Q5    Q50    Q95   Mean  SD  Rhat Bulk_ESS Tail_ESS
lambda    1.4    1.5    1.7    1.5 0.1     1     1504     1913
lp__   -313.7 -311.8 -311.6 -312.1 0.8     1     1808     2158
1 Like

So we see the expected difference, ie, marginalized distribution is easier to sample and requires less log-density evaluations.

You need to define the warmup argument.

Right, but how does that relate to ESS? Should we not have higher ESS then?

This is weird. I thought warmup was set to floor(iterations/2) by default. Is this not the case anymore?

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.19.2       ggplot2_3.2.1      StanHeaders_2.19.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         pillar_1.4.2       compiler_3.6.1     prettyunits_1.0.2  tools_3.6.1       
 [6] digest_0.6.22      pkgbuild_1.0.6     tibble_2.1.3       gtable_0.3.0       pkgconfig_2.0.3   
[11] rlang_0.4.1        cli_1.1.0          rstudioapi_0.10    parallel_3.6.1     loo_2.1.0         
[16] gridExtra_2.3      withr_2.1.2        dplyr_0.8.3        stats4_3.6.1       grid_3.6.1        
[21] tidyselect_0.2.5   glue_1.3.1         inline_0.3.15      R6_2.4.0           processx_3.4.1    
[26] callr_3.3.2        purrr_0.3.3        magrittr_1.5       scales_1.0.0       ps_1.3.0          
[31] codetools_0.2-16   matrixStats_0.55.0 assertthat_0.2.1   colorspace_1.4-1   labeling_0.3      
[36] KernSmooth_2.23-15 lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

I think this is a bug in the output of rstan::monitor, which sets the warmup to 0 for a stanfit object. I haven’t worked out yet if it’s only a display problem or if it affects computations (I doubt it).

ESS depends on how far the simulation goes on expectation. Here in both cases, we end up about as far, but inn an easier distribution less steps are needed for a u-turn. In addition of ESS, it is useful to look at ESS/leapfrog-step for efficiency. Sometimes we care also about ESS/second.

Not in monitor() (which assumes you didn’t save the warmup iterations).

2 Likes

Computing ESS/sum(leap_frogs) does show that the marginalised model is more efficient, even if it’s not orders of magnitude. As discussed above this is expected due to the low dimension of the problem.

3 Likes