Improved effective sample size estimation

In the Stan weekly roundups I read that some of the the Stan team members (@andrewgelman, @Bob_Carpenter, betanalpha) are working on improving the estimation of the effective sample size. I wanted to share some papers that describe an accurate method to estimate the effective sample size based on automatically selected time series models, and also offers a theoretical analysis (Cramer-Rao lower bound):

Estimation of the accuracy of mean and variance of correlated data

The described algorithm uses the ARMAsel algorithm that selects amongst an AR,MA and ARMA model (code available on matlabcentral). However, a simplified implementation using only Autoregressive (AR) models may well suffice and only requires the Burg AR estimator, which is available in many packages.

To estimate the effective sample size from multiple chains, I recommend the Burg algorithm for segments, see:

The Burg algorithm for segments

I am curious to hear your thoughts!

Thanks for the link.

We’re just trying to write up what’s already going on in Stan in a medium other than BDA and the Stan manual with some more motivation.

What’s the run time on those AR algorithms and how stable are they? It’s expensive enough to just run the FFTs we’re running now.

The paper is beyond me (at least without serious investment in learning a bunch of things I don’t know, like what a Cramer-Rao bound is). I can’t even tell if one of the methods there is similar to what we’re doing.

We’re very keen to include cross-chain information in these calculations so that we can adjust ESS estimates downward if the chains aren’t very similar in terms of mean and variance.

On a higher level I can summarize the paper as follows:

  1. The effective sample size can be calculated with a simple formula from an estimated parametric model with an automatically selected model order using fast and robust algorithms.
  2. The error in the estimated effective sample size is close to the theoretical lower bound on estimation error (= Cramer Rao lower bound).

This paper compares this method to summing over the sample autocovariance, which is closer to the current algorithm in Stan:

Here’s the algorithm in Python, starting from the sample autocorrelation function acor_sample, as is also used in the current n_eff computation. An AR model is selected using the order selection criterion GIC3 (similar to AIC, but with penalty factor 3):

from statsmodels.tsa.stattools import levinson_durbin, acovf
pmax =[100,np.floor(niter/2)]))
_,_,pacf,res,phi = levinson_durbin(acor_sample,pmax,isacov=True)
res[0] = 1
p = np.arange(pmax+1)
gic3 = niter_tot*np.log(res)+3*p
p_sel = np.argmin(gic3)
ar_sel = phi[:p_sel+1,p_sel]
sigma2_eps_norm = res[p_sel]

alpha = sigma2_eps_norm/(1-ar_sel.sum())**2
n_eff = niter*chains/alpha

Most of the computations are done in levinson_durbin, < 20 lines of Python code. The AR estimation algorithm, also known as the Yule-Walker algorithm, is also available in R.

For completeness, here is the Python code for computation of the autocorrelation starting from the samples s in an niter x chains numpy array:

import numpy as np
sd = s-s.mean()
acov_sample = np.array([acovf(x,demean=False,fft=True) for x in sd.T]).mean(axis=0)
acor_sample = acov_sample/acov_sample[0]

Even though levinson_durbin above is implemented in pure Python, execution time is not too bad: I clocked 0.6 ms for the autocorrelation computation, and 4.4 ms for Levinson-Durbin. I expect a 5-10x speedup when converting to C++. If pressed for computational resources, one may consider reducing the maximum order (100). For example, using 50 instead reduces computation time to 1 ms. Please note that the model order does not correspond to the correlation length; even an AR(1) model can have an arbitrary large correlation length.

What happens with multiple chains? Our approach is taking multiple Markov chains and discounting effective sample size when they don’t match.

What is the relationship between the AR order and the estimation? That is, what led you to choose 100 as the order rather than 50 to begin with?

This is taken care of because the overall mean of all chains is subtracted. If, for example, one of the chains has a different mean, this will automatically translate into a high autocorrelation, leading to a low n_eff estimate.

This number is the maximum considered model order, which is an indication of the model complexity. You can go at least as high as N/4, where N is the chain length. However, this comes at a higher computational cost and is not required for most applications. The model order used for the n_eff computation is determined using a statistical order selection criterion gic3. For example, the mu parameter in the 8 schools example, the automatically selected model order is typically around 10.