N_eff BDA3 vs. Stan

Usual notation including BDA3 is what you write, but here more specifically
N_eff = N / (1 + 2\sum_[t=1}^T \hat{\rho}_t),
where T is the last lag to include and \hat{\rho}_t is the empirical autocorrelation for lag t.
By definition \rho_0=1.

In the above example we have
\hat{\rho}_1 \approx -0.055
\hat{\rho}_2 \approx 0.038
\hat{\rho}_3 \approx -0.007
\hat{\rho}_4 \approx -0.011
\hat{\rho}_5 \approx 0.005

Geyer’s paired sums are
\rho_0+\hat{\rho}_1 \approx 0.945 (positive, include in the sum)
\hat{\rho}_2+\hat{\rho}_3 \approx 0.031 (positive, include in the sum)
\hat{\rho}_4+\hat{\rho}_5 \approx -0.005 (negative, stop, don’t include in the sum)

Thus we include lags 1,2,3
\hat{\rho}_1+\hat{\rho}_2+\hat{\rho}_3 \ approx -0.024

1 + 2\sum_[t=1}^T \hat{\rho}_t \approx 0.953

N_eff = N / (1 + 2\sum_[t=1}^T \hat{\rho}_t) \ approx 4198 > 4000

Geyer uses autocovariances instead of autocorrelations, but that’s same up to scaling. Geyer also writes alternative form
-1 + 2\sum_[t=0}^T \hat{\rho}_t
where summing starts from t=0, but then 1 is changed to -1

We have now two different sums

  1. \sum_[t=1}^T \hat{\rho}_t can be negative
  2. \sum_[t=0}^T \hat{\rho}_t is always positive, but can be smaller than 1

Starting the summing from lag 0 has the benefit that it’s enough that we store the paired sum values during the computation and then sum them together. Starting the summing from lag 1 has the benefit that it’s the more commonly used form in articles and textbooks.

Ah, right. The 1 gets pulled out of the sum, breaking up the pairs that are used in the termination criterion. The form

(1 + 2\sum_[t=1}^T \hat{\rho}_t)

is definitely preferred for the reasons that you state.

If we change the code to use the paired-sum termination criterion so that all of the included empirical pair sums are positive then I would be okay reporting N_eff > N.

1 Like

I’ll double check my changes in chains.hpp and make a pull request in a few days.
I’ll make it to report n_eff without capping.

When calculating standard errors of means, should you use n_eff directly
when n_eff > N?

If we have anti-correlated MCMC, could we get more than one effective draw per iteration?

If I have a Markov chain with two states, emit 0 and emit 1 with:

Pr[1 | 1] = 0.5     Pr[0 | 0] = 0.5
Pr[0 | 1] = 0.5     Pr[1 | 0] = 0.5

then we get independent draws of a coin flip.

But if we have

Pr[1 | 1] = 0.01     Pr[0 | 0] = 0.01
Pr[0 | 1] = 0.99     Pr[1 | 0] = 0.99

then we are anti-correlated. Will this give me more than 1 effective draw per iteration?

Yes. Relevant keywords are super-efficient sampling, antithetic, and over-relaxation. Usually it’s really difficult to get n_eff>N although some toy examples can be found in the literature (too late in Finland, I’ll find some references tomorrow). Mike says that in NUTS “sampling from the trajectory also biases to states away from the initial point”. This is similar to over-relaxation (see, e.g., http://www.cs.toronto.edu/~radford/over.abstract.html) which seems to lead in super-efficient behavior in many cases in Stan.

1 Like

Thanks. I guess it’s pretty obvious for the totally anti-correlated case. The error goes down as 1/N rather than 1/sqrt(N).

Good point about the bias in NUTS—it’s a subtle end tweak that makes a big difference in efficiency.

Here are experimental results comparing the current N_eff computation to a version using Geyer’s initial positive sequence. For three first examples the true parameter values are known and for 8 schools example true values were estimated using 4 million draws. The reference result for sampling efficiency is obtained by repeating sampling 100-1000 times. Each repeat used 4 chains with total of 4000 draws saved after warm-up.

Normal distribution with diagonal covariance and increasing number of dimensions

data {
  int<lower=0> D;           // number of dimensions
parameters {
  vector[D] mu; 
model {
  mu ~ normal(0, 1);

NUTS can be superefficient (antithetic Markov chain), but the current estimate caps N_eff<=N.

Normal distribution with diagonal covariance, constraint to all positive values and increasing number of dimensions. Due to the constraint the posterior distribution is not Gaussian.

data {
  int<lower=0> D;           // number of dimensions
parameters {
  vector<lower=0>[D] mu; 
model {
  mu ~ normal(0, 1);

Again the current estimate caps N_eff<=N, while true N_eff can be larger than N.

Normal distribution with diagonal values set to 1 and all off-diagonal values set to rho with varying rho, and the number of dimensions D=16 or D=64.

data {
  int<lower=0> D;           // number of dimensions
  real<lower=0,upper=1> s;  // correlation
transformed data {
  vector[D] zero = rep_vector(0.0, D);
  cov_matrix[D] S;
  cholesky_factor_cov[D] L;
  for (i in 1:D) {
    for (j in 1:(i-1)) {
  L = cholesky_decompose(S);
parameters {
  vector[D] mu;
model {
  mu ~ multi_normal_cholesky(zero, L);

Upper lines (with higher N_eff’s) are for D=16 and bottom lines (with lower N_eff’s) are for D=32. Again the current estimate caps N_eff<=N, while true N_eff can be larger than N. In addition we see that the current N_eff estimate can also overestimate N_eff. This is likely due to truncating autocorrelation series too early.

FInally 8 schools with non-centered parametrization

data {
  int<lower=0> J;           // number of schools 
  vector[J] y;              // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates 
parameters {
  real mu; 
  real<lower=0> tau;
  vector[J] eta;
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;
model {
  eta ~ normal(0, 1);
  tau ~ normal(0, 5);
  y ~ normal(theta, sigma);

The difference between the estimate from repeated simulations and the estimate using Geyer’s initial positive is mostly due to finite number of repeated Stan runs used.

It’s likely that in most practical cases N_eff is rarely much higher than N, but all these results show that N_eff estimated using Geyer’s initial positive sequence is more accurate.

I also did run the same experiments with Geyer’s initial monotone sequence estimator. The results are really close to the results with initial positive sequence estimator (as in my old experiments years ago, too).

This is interesting. We can also look into non-normal distributions. For example if a qoi is Cauchy distributed, then R-hat is useless, but one could still have success with quantile-based coverage methods, or an R-hat-like thing constructed based on MAD_SD.

@avehtari, just as a sanity check, does the interpretation of the more optimistic n_eff make sense when using it for estimating standard error of the mean?

I think I know how to set up a simulation to check, but I won’t get to it until after StanCon. I didn’t see that sort of verification in the thread, but it wouldn’t surprise me if you’ve already done that. (I don’t know how to sample better than true iid, but I believe that the math may work out. I’d just want a little bit of empirical validation before claiming that Stan is able to sample iid gaussian better [per iteration, not time] than an rng.)

Yes. In my post N_eff BDA3 vs. Stan the four figures show the simulation check with realized true errors of the mean with dashed line, and as you can see the version using Geyer’s initial positive sequence is estimating that more accurately and yes NUTS can sometimes be super efficient and get lower error than independent draws.

Thanks! I didn’t realize that’s what the dotted line was. How did you compute the dotted line? I just want to make sure I understand it. (I didn’t get it from the description in the post.)

Each simulation is using default Stan setting with 4 chains and 1000 draws per each chain saved. For each simulation I compute the empirical mean for each parameter. For normal, truncated normal and correlated normal I run 100 simulations, and the true mean is known exactly. For 8 schools I run 1000 simulations, and true mean is approximated by 4 million draws from the posterior. I then compute mean square error of empirical means (compared to true mean) and compare these to theoretical expected mean square error of independent draws (which can be computed analytically for normal and is estimated from posterior variance in 8 schools). Ratio of these tells the relative efficiency of NUTS sample. I didn’t post the code because it’s in Matlab, but I can post it if you want to check details.

I’m starting simple and trying to understand the dashed line for the first example in this post which is just a the normal. I still don’t understand what the dashed line is. Maybe the code would help?


If you ran the same code on iid draws from normals, I’d expect Geyer’s initial positive hover close to 1 (with noise)?

Here’s the code (I used MatlabStan as recompiling summarystan binary was easier for me than figuring out what to do in rstan). MatlabStan uses CmdStan which leaves the csv files behind. Dashed line is computed using rmse and in the plotting line it’s squared to get mse and 1./4000 is the analytic result for independent draws.

for di=1:11
    for i2=1:100
        dat = struct('D',D);
        fitn = stan('file',model,'data',dat,'sample_file','schools','file_overwrite',true,'verbose',false,'seed', i2, 'iter', 2000);
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_ipse --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        warning on
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_imse --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        warning on
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_old --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        warning on

hold on
line(xlim,[1 1],'color','k','linestyle',':')
xlim([0 12])
ylim([0 2.6])
xlabel('Dimensions D')
legend([h1 h2 h3],'Current','Geyer''s initial positive','Estimate from repeated simulations','location','southeast')

I generated 4x1000 iid draws, so that the split chain part stays the same. Here are violinplots with Geyer’s initial positive (blue) and with the current (red, with some jitter for exact 4000 values). This gives idea of overall accuracy for n_eff estimates

I’m assuming the dotted line is the empirical effective error. That is, if you run 1000 simulations and know the true value of the posterior mean, you can get an empirical measurement of the mcmc standard error (mcmc_se). Then given that you also know the posterior standard deviation (sd), you can use the formula

mcmc_se = sd / sqrt(n_eff)

to solve for n_eff given the known sd and empirically calculated mcmc_se.

Is that what’s going on here, Aki?

I’m not sure how well the different line types display. Dotted line at 1 is the theoretical performance of independent draws. Dashed line which is close to Geyer’s initial positive etc. is the empirical effective error with 100 or 10000 simulations.

Yes. In normal and truncated normal we know true values and in 8 schools they are estimate with 4 million draws from the posterior.

Yes. For the normal and truncated normal I also use the knowledge of the true posterior variance. For 8 schools I estimate sd.

Thanks, @Bob_Carpenter! That’s exactly what I was thinking about.

I think I’m looking for a different line. It doesn’t sound like the dashed line is actually the n_eff computed from this formula:

or going the other way:

n_eff = (sd / mcmc_se)^2

I didn’t see that sort of estimate happening in the Matlab code.

It’s these two lines


True mu=0 and sd=1. And since I plot N_eff/N, there’s also 1./4000. Not the most clear to write it, but I was in a hurry.

Thanks, @avehtari! It’s been a long time since I read Matlab code. I’ll need a few to decompress that line. (the first line; the second one is clear)