Marginalisation and efficiency

maxbiostat · November 12, 2019, 9:22pm

This is a conceptual question and very likely stems from my own ignorance. Suppose you have a (toy) model like this:
Y \sim \text{Poisson}(\lambda),
\lambda \sim \text{Exponential}(1/\theta),
\theta \sim \text{Inverse-Gamma}(\alpha, \beta).

One can specify a program like

data{
  int<lower=0> K;
  int<lower=0> y[K];
  real<lower=0> alpha;
  real<lower=0> beta;
}
parameters{
  real<lower=0> lambda;
  real<lower=0> theta;
}
model{
  target+= poisson_lpmf(y | lambda);
  target+= exponential_lpdf(lambda | 1/theta);
  target+= inv_gamma_lpdf(theta | alpha, beta);
}

to sample from this model and it works really well, as expected.
One can try to be smart and integrate out \theta, specifying the program

data{
  int<lower=0> K;
  int<lower=0> y[K];
  real<lower=0> alpha;
  real<lower=0> beta;
}
parameters{
  real<lower=0> lambda;
}
model{
  target+= poisson_lpmf(y | lambda);
  target+= log(alpha) + alpha*log(beta)-(alpha+1)*log(lambda + beta);
}

which samples well, but achieves much lower ESS. Here’s the output for the full model and the marginalised one:

Inference for Stan model: toy_full.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
lambda     1.05    0.00 0.03     0.99     1.03     1.05     1.07     1.11  2925    1
theta      5.45    0.11 4.71     1.51     2.77     4.10     6.46    17.90  1995    1
lp__   -1329.94    0.03 1.05 -1332.78 -1330.34 -1329.61 -1329.19 -1328.92  1582    1

Inference for Stan model: toy_marginalised.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

           mean se_mean   sd     2.5%      25%      50%      75%    97.5% n_eff Rhat
lambda     1.05    0.00 0.03     0.99     1.03     1.05     1.07     1.12  1253    1
lp__   -1329.69    0.02 0.71 -1331.67 -1329.85 -1329.42 -1329.24 -1329.19  1541    1

I naively expected the marginalisation to improve sampling efficiency, but I guess there’s no guarantee that marginalising leads to better geometry, eh?

Tagging @betanalpha @avehtari @andrewgelman @bgoodri @Bob_Carpenter and @martinmodrak and apologising in advance if this is a stupid/trivial question.

martinmodrak · November 13, 2019, 5:32am

A quick thought: efficiency is usually measured in ESS per unit of time, not ESS per iteration. Is the decrease in ESS larger or smaller than the decrease in time? I will leave discussion of actual causes to others, I would be just speculating.

andrjohns · November 13, 2019, 5:53am

As a rough guess, the difference might be due to the loss of analytical gradients by moving away from the lpdf functions, since the auto-diffed gradients might not be as efficient (or resistant to over-/under-flow) as the analytical ones

avehtari · November 13, 2019, 7:20am

in small dimensional examples there can be a just a small difference and marginalisation has a bigger effect when marginalizing over many dimensions
can you show n_leapfrog__ (e.g. with mean(rstan::get_num_leapfrog_per_iteration(fit)))?
can you show Tail_ESS (e.g. with recent RStan rstan::monitor(fit))?

maxbiostat · November 13, 2019, 11:28am

Yeah, sorry I should’ve done that. Run times are usally larger for the marginalised version, so I didn’ bother.

This is a very insightful comment. It might very well be that, although it’d be a bit of work to test it: I guess I could write out the marginalised density on \lambda in C++ and equip it with an analytical gradient.

Yeah, that’s the realisation I arrived at as well.

Yup, here it is:

Yes:

> rstan::monitor(full.samples)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 0):

           Q5    Q50    Q95   Mean  SD  Rhat Bulk_ESS Tail_ESS
lambda    1.4    1.5    1.7    1.5 0.1     1     2564     2393
theta     0.4    0.9    3.0    1.2 1.0     1     2500     2063
lp__   -315.0 -312.7 -312.1 -313.0 1.0     1     1940     2001

rstan::monitor(marginalised.samples)
Inference for the input samples (4 chains: each with iter = 2000; warmup = 0):

           Q5    Q50    Q95   Mean  SD  Rhat Bulk_ESS Tail_ESS
lambda    1.4    1.5    1.7    1.5 0.1     1     1504     1913
lp__   -313.7 -311.8 -311.6 -312.1 0.8     1     1808     2158

avehtari · November 13, 2019, 12:15pm

So we see the expected difference, ie, marginalized distribution is easier to sample and requires less log-density evaluations.

You need to define the warmup argument.

maxbiostat · November 13, 2019, 12:18pm

Right, but how does that relate to ESS? Should we not have higher ESS then?

This is weird. I thought warmup was set to floor(iterations/2) by default. Is this not the case anymore?

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 19.10

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C               LC_TIME=pt_BR.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstan_2.19.2       ggplot2_3.2.1      StanHeaders_2.19.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3         pillar_1.4.2       compiler_3.6.1     prettyunits_1.0.2  tools_3.6.1       
 [6] digest_0.6.22      pkgbuild_1.0.6     tibble_2.1.3       gtable_0.3.0       pkgconfig_2.0.3   
[11] rlang_0.4.1        cli_1.1.0          rstudioapi_0.10    parallel_3.6.1     loo_2.1.0         
[16] gridExtra_2.3      withr_2.1.2        dplyr_0.8.3        stats4_3.6.1       grid_3.6.1        
[21] tidyselect_0.2.5   glue_1.3.1         inline_0.3.15      R6_2.4.0           processx_3.4.1    
[26] callr_3.3.2        purrr_0.3.3        magrittr_1.5       scales_1.0.0       ps_1.3.0          
[31] codetools_0.2-16   matrixStats_0.55.0 assertthat_0.2.1   colorspace_1.4-1   labeling_0.3      
[36] KernSmooth_2.23-15 lazyeval_0.2.2     munsell_0.5.0      crayon_1.3.4

mcol · November 13, 2019, 12:51pm

I think this is a bug in the output of rstan::monitor, which sets the warmup to 0 for a stanfit object. I haven’t worked out yet if it’s only a display problem or if it affects computations (I doubt it).

avehtari · November 13, 2019, 12:56pm

ESS depends on how far the simulation goes on expectation. Here in both cases, we end up about as far, but inn an easier distribution less steps are needed for a u-turn. In addition of ESS, it is useful to look at ESS/leapfrog-step for efficiency. Sometimes we care also about ESS/second.

Not in monitor() (which assumes you didn’t save the warmup iterations).

maxbiostat · November 13, 2019, 1:09pm

Computing ESS/sum(leap_frogs) does show that the marginalised model is more efficient, even if it’s not orders of magnitude. As discussed above this is expected due to the low dimension of the problem.

Topic		Replies	Views
Marginalizing Continuous Variables Modeling	8	1743	February 9, 2018
Marginalising out missing data Modeling	5	901	June 22, 2018
Marginalising 2 parameters: Poisson change point model Modeling	3	539	December 19, 2019
Efficiency improvement over inverse cdf sampling for simulated multivariate normal draws Modeling performance	3	759	July 2, 2018
Marginalizing in simplified discrete model Modeling	0	349	March 7, 2022

Marginalisation and efficiency

Related Topics