N_eff BDA3 vs. Stan


#1

I’m trying to compare n_eff computation described in BDA3 and the implemented code, and I think there is a small difference. The comment in the beginning says the implementation matches BDA3. BDA3 (at least version 20150505) describes truncation of the autocorrelation series using Geyer’s initial positive sequence (Geyer, 1992) where truncation “T is the first odd positive integer for which $\widehat{\rho}{T+1}+\widehat{\rho}{T+2}$ is negative.”. If I understand the code (stan/src/stan/mcmc/chains.hpp) it’s using “T is the last positive integer for which \widehat{\rho}_{T} is positive.”, that is, it’s not using the sum of odd and even lags for which Geyer had a theoretical argument.

Ping @syclik (he wrote the code)

      /**
       * Returns the effective sample size for the specified parameter
       * across all kept samples.
       *
       * The implementation matches BDA3's effective size description.
       *
       * Current implementation takes the minimum number of samples
       * across chains as the number of samples per chain.
       *
       * @param VectorXd
       * @param Dynamic
       * @param samples
       *
       * @return
       */
      double effective_sample_size(const Eigen::Matrix<Eigen::VectorXd,
                                   Dynamic, 1> &samples) const {
        int chains = samples.size();

        // need to generalize to each jagged samples per chain
        int n_samples = samples(0).size();
        for (int chain = 1; chain < chains; chain++) {
          n_samples = std::min(n_samples,
                               static_cast<int>(samples(chain).size()));
        }

        Eigen::Matrix<Eigen::VectorXd, Dynamic, 1> acov(chains);
        for (int chain = 0; chain < chains; chain++) {
          acov(chain) = autocovariance(samples(chain));
        }

        Eigen::VectorXd chain_mean(chains);
        Eigen::VectorXd chain_var(chains);
        for (int chain = 0; chain < chains; chain++) {
          double n_kept_samples = num_kept_samples(chain);
          chain_mean(chain) = mean(samples(chain));
          chain_var(chain) = acov(chain)(0)*n_kept_samples/(n_kept_samples-1);
        }

        double mean_var = mean(chain_var);
        double var_plus = mean_var*(n_samples-1)/n_samples;
        if (chains > 1)
          var_plus += variance(chain_mean);
        Eigen::VectorXd rho_hat_t(n_samples);
        rho_hat_t.setZero();
        double rho_hat = 0;
        int max_t = 0;
        for (int t = 1; (t < n_samples && rho_hat >= 0); t++) {
          Eigen::VectorXd acov_t(chains);
          for (int chain = 0; chain < chains; chain++) {
            acov_t(chain) = acov(chain)(t);
          }
          rho_hat = 1 - (mean_var - mean(acov_t)) / var_plus;
          if (rho_hat >= 0)
            rho_hat_t(t) = rho_hat;
          max_t = t;
        }
        double ess = chains * n_samples;
        if (max_t > 1) {
          ess /= 1 + 2 * rho_hat_t.sum();
        }
        return ess;
      }

#2

Ah, found in the manual

Exact autocorrelations can happen only on odd lags (Geyer, 2011). By summing over
pairs, the paired autocorrelation is guaranteed to be positive modulo estimator noise.
This is the motivation behind the many termination criterion of Geyer (2011).
Stan does not (yet) do the paired expectations because NUTS almost by construction avoids
the negative autocorrelation regime. Thus terminating at the first negative autocorre-
lation is a reasonable approximation for stopping when the noise in the autocorrela-
tion estimator dominates.

The manual text matches the code, but the comment in the code doesn’t match with the code.


#3

I love it when people answer their own questions. This kind of thing is why I always rant about comments in code!

I filed a code issue:

https://github.com/stan-dev/math/issues/683


#4

I’m preparing a pull request to fix the code to follow the algorithm described in BDA3. Mostly this code gives similar results although on average slightly smaller n_eff’s as it has less noise in the stopping rule and the truncation time is then on average larger.
One thing is that, sometimes the chains can be superefficient, so that odd lags (usually only the first) can be negative and n_eff can be larger than the number of MCMC draws. The current code ignores this and returns then just the number of draws. Previously when I believed the code is doing the BDA3 thing, I thought it was just intentionally capped to the number of draws to avoid questions about superefficiency.
My question now is that should we report superefficient sampling (which with simple models seems to happen quite easily) or should keep capping the n_eff to the number of draws (as the superefficiency is anyway mostly about 10% so we would just be conservative)?


#5

I was trying to remember why the drifted from BDA; I think it was the
implementation inside R2WinBUGS that inspired it. If it’s at all
interesting, I can dig into our old stan-dev mailing list to see if I can
find out why it went that way.

I’d want to cap n_eff at the number of draws.


#6

@avehtari – the sum of neighboring pairs of autocorrelations should always be positive (even if the odd lags are negative due to super-efficiency), and if they end up negative due to noise then the stopping rule should trigger, leaving only the previous positive autocorrelation-pair-sums. Consequently any of the Geyer-based algorithms should always return N_eff < N, no?

We shouldn’t simply report min(N_eff, N). If we don’t want to report super efficient results then we should use one of the Geyer sum-pair methods.


#7

No.

Here’s an example figure of Geyer’s initial positive sequence
geyer_ipse

Upper plot shows the autocorrelation estimated as described in BDA3. The autocorrelation with lag 1 is negative (-0.082) and the current chains.hpp code then cuts at t=0. The bottom plot shows the sums of even and odd lag autorrelation pairs. Geyer writes that series of these are positive, monotone and convex, and thus proposes cutting when the first one of the pairs is negative. This still allows negative autocorrelations for odd lags which is possible to happen. In this example, the first negative pair is for lags 4 and 5, and thus the last autocorrelation to be included is lag 3. Sum of autocorrelations from lags 0, 1, 2 and 3 is 0.945 and the relative efficiency is 106%. Usually the autocorrelation for lag 0 is shown in equations directly as 1, and then the sum of autocorrelations from lags 1, 2 and 3 is -0.055. I did test with longer chains, and the negative autocorrelation at lag 1 persists.

Years ago (my code is from 2002-2003) I experimented with this a lot, and I found it to work well and I trust it so much that I put it in BDA3. I experimented also with Geyer’s monotone and convex sequence estimators, but there wasn’t much difference and initial positive sequence estimator is slightly more conservative.

I would prefer to use Geyer’s monotone sequence estimator as described in BDA3, and I would not cap n_eff, but I would try to inform widely to not be surprised that n_eff can be larger than N. Some people have also wondered how it’s possible to get exactly n_eff equal to N.


#8

Ah, I think you’re missing a subtle point here. Geyer’s claim is that the sum of the pairs of the true autocorrelations are positive, convex, and monotone hence the effective sample size constructed from the pairs of true autocorrelations should always bounded by N. Of course we don’t have access to the true autocorrelations but rather estimators thereof which are corrupted by noise. That noise can cause the sum of the pairs of estimates autocorrelations to drop below zero indicating that the autocorrelations are no longer significant. Geyer then uses this as motivation for his stopping rules (either using the raw summed pairs of estimated autocorrelations or smoothed version of those estimators).

In other words, when you see the sum of the pairs of estimated autocorrelations drop below zero then the summation over autocorrelations should immediately stop ensuring that N_eff < N.


#9

Geyer’s initial sequence estimator is used only for selecting the truncation. Geyer (1992) writed “The odd-lag autocovariances need not be positive (though for a reversible chain the even lag must be)” and see also Theorem 3.1 which makes it quite clear that Geyer writes about behavior of true autocovariances. Given the same truncation point, summing individual autocovariance estimates or paired sums of estimates doesn’t change the result. I thought my figure would explain this, too.

Of course n_eff estimate has variance and thus we may observe \hat{n_eff}>N just by chance, as well as we can observe \hat{n_eff}<N even if we are examining independent draws. Btw. Geyer provides proof that any of his initial sequence estimators asymptotically overestimates the autocorrelation time, and thus we would expect asymptotically to underestimate n_eff.


#10

Yes the summed pairs are use to compute the truncation point, but by construction that also means that the truncated sum will always be positive as any excess negative contribution is beyond the truncation point by definition. No?

Also should the pairs start at lag 0 (lag 0 + lag 1, lag 2 + lag 3, etc as you have in your plot) or at lag 1 (lag 0 is ignored because it’s always 1, lag 1 + lag 2, lag 3 + lag 4, etc)? For the latter convention in your case we’d truncate after lag 0 because lag 1 + lag 2 would be negative and we’d end up with N_eff = N.

Part of the bias towards lower N_eff estimates with these methods is the conservative nature of the truncation.


#11

Truncated sum starting from lag 1 can be negative and is negative in the above example (-0.055 as mentioned above).

Pairs start at lag 0 (lag 0 + lag 1, lag 2 + lag 3, etc as I have in my plot, and as in BDA3 and as in Geweke’s paper). This comes from the theory. Pairs of true values are positive. Estimated lag 1 can not be less than -1, and thus lag 0 + lag 1 is always estimated to positive and lag 1 is always included. Tru lag 2 + lag 3 is positive, but estimated lag 2 + lag 3 can be negative, and if it’s negative we know that it’s due to noise and we can ignore these and truncate. Same for further lags.


#12

Here’s the variation of n_eff (using my code for theta[4] which is the same variable as in the above autocorrelation figure) when using the default 4 chains with 1000 iterations each (100 repetitions).

neff_geyer_ipse

The variation is high. In 72% cases n_eff is larger than 4000.

If you want to analyse Markov chains yourself, here’s the non-centered 8 schools code I used

data {
  int<lower=0> J;           // number of schools 
  vector[J] y;              // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates 
}
parameters {
  real mu; 
  real<lower=0> tau;
  vector[J] eta;
}
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;
}
model {
  eta ~ normal(0, 1);
  tau ~ normal(0, 5);
  y ~ normal(theta, sigma);
}
generated quantities {
  vector[J] log_lik;
  for (j in 1:J)
    log_lik[j] = normal_lpdf(y[j] | theta[j], sigma[j]);
}

#13

Right, but following this logic the empirical sums included in the effective sample size calculation will always be positive (the first negative sum you find terminates the sum and gets thrown away) so for N_eff = N / (1 + sum empirical partial autocross) we’ll always have the denomination greater than or equal to zero and hence N_eff capped at N.

Are you including the first empirical sum of paired empirical autocorrelations in the calculation?


#14

Usual notation including BDA3 is what you write, but here more specifically
N_eff = N / (1 + 2\sum_[t=1}^T \hat{\rho}_t),
where T is the last lag to include and \hat{\rho}_t is the empirical autocorrelation for lag t.
By definition \rho_0=1.

In the above example we have
\rho_0=1
\hat{\rho}_1 \approx -0.055
\hat{\rho}_2 \approx 0.038
\hat{\rho}_3 \approx -0.007
\hat{\rho}_4 \approx -0.011
\hat{\rho}_5 \approx 0.005

Geyer’s paired sums are
\rho_0+\hat{\rho}_1 \approx 0.945 (positive, include in the sum)
\hat{\rho}_2+\hat{\rho}_3 \approx 0.031 (positive, include in the sum)
\hat{\rho}_4+\hat{\rho}_5 \approx -0.005 (negative, stop, don’t include in the sum)

Thus we include lags 1,2,3
\hat{\rho}_1+\hat{\rho}_2+\hat{\rho}_3 \ approx -0.024

Then
1 + 2\sum_[t=1}^T \hat{\rho}_t \approx 0.953

and
N_eff = N / (1 + 2\sum_[t=1}^T \hat{\rho}_t) \ approx 4198 > 4000

Geyer uses autocovariances instead of autocorrelations, but that’s same up to scaling. Geyer also writes alternative form
-1 + 2\sum_[t=0}^T \hat{\rho}_t
where summing starts from t=0, but then 1 is changed to -1

We have now two different sums

  1. \sum_[t=1}^T \hat{\rho}_t can be negative
  2. \sum_[t=0}^T \hat{\rho}_t is always positive, but can be smaller than 1

Starting the summing from lag 0 has the benefit that it’s enough that we store the paired sum values during the computation and then sum them together. Starting the summing from lag 1 has the benefit that it’s the more commonly used form in articles and textbooks.


#15

Ah, right. The 1 gets pulled out of the sum, breaking up the pairs that are used in the termination criterion. The form

(1 + 2\sum_[t=1}^T \hat{\rho}_t)

is definitely preferred for the reasons that you state.

If we change the code to use the paired-sum termination criterion so that all of the included empirical pair sums are positive then I would be okay reporting N_eff > N.


#16

I’ll double check my changes in chains.hpp and make a pull request in a few days.
I’ll make it to report n_eff without capping.


#17

When calculating standard errors of means, should you use n_eff directly
when n_eff > N?


#18

If we have anti-correlated MCMC, could we get more than one effective draw per iteration?

If I have a Markov chain with two states, emit 0 and emit 1 with:

Pr[1 | 1] = 0.5     Pr[0 | 0] = 0.5
Pr[0 | 1] = 0.5     Pr[1 | 0] = 0.5

then we get independent draws of a coin flip.

But if we have

Pr[1 | 1] = 0.01     Pr[0 | 0] = 0.01
Pr[0 | 1] = 0.99     Pr[1 | 0] = 0.99

then we are anti-correlated. Will this give me more than 1 effective draw per iteration?


#19

Yes. Relevant keywords are super-efficient sampling, antithetic, and over-relaxation. Usually it’s really difficult to get n_eff>N although some toy examples can be found in the literature (too late in Finland, I’ll find some references tomorrow). Mike says that in NUTS “sampling from the trajectory also biases to states away from the initial point”. This is similar to over-relaxation (see, e.g., http://www.cs.toronto.edu/~radford/over.abstract.html) which seems to lead in super-efficient behavior in many cases in Stan.


#20

Thanks. I guess it’s pretty obvious for the totally anti-correlated case. The error goes down as 1/N rather than 1/sqrt(N).

Good point about the bias in NUTS—it’s a subtle end tweak that makes a big difference in efficiency.