N_eff BDA3 vs. Stan

avehtari · December 29, 2017, 7:17pm

Here are experimental results comparing the current N_eff computation to a version using Geyer’s initial positive sequence. For three first examples the true parameter values are known and for 8 schools example true values were estimated using 4 million draws. The reference result for sampling efficiency is obtained by repeating sampling 100-1000 times. Each repeat used 4 chains with total of 4000 draws saved after warm-up.

Normal distribution with diagonal covariance and increasing number of dimensions

data {
  int<lower=0> D;           // number of dimensions
}
parameters {
  vector[D] mu; 
}
model {
  mu ~ normal(0, 1);
}

normal_neff_geyer_ipse
NUTS can be superefficient (antithetic Markov chain), but the current estimate caps N_eff<=N.

Normal distribution with diagonal covariance, constraint to all positive values and increasing number of dimensions. Due to the constraint the posterior distribution is not Gaussian.

data {
  int<lower=0> D;           // number of dimensions
}
parameters {
  vector<lower=0>[D] mu; 
}
model {
  mu ~ normal(0, 1);
}

normalplus_neff_geyer_ipse
Again the current estimate caps N_eff<=N, while true N_eff can be larger than N.

Normal distribution with diagonal values set to 1 and all off-diagonal values set to rho with varying rho, and the number of dimensions D=16 or D=64.

data {
  int<lower=0> D;           // number of dimensions
  real<lower=0,upper=1> s;  // correlation
}
transformed data {
  vector[D] zero = rep_vector(0.0, D);
  cov_matrix[D] S;
  cholesky_factor_cov[D] L;
  for (i in 1:D) {
    for (j in 1:(i-1)) {
      S[i,j]=s;
      S[j,i]=s;
    }
    S[i,i]=1.0;
  }
  L = cholesky_decompose(S);
}
parameters {
  vector[D] mu;
}
model {
  mu ~ multi_normal_cholesky(zero, L);
}

normalrho_neff_geyer_ipse
Upper lines (with higher N_eff’s) are for D=16 and bottom lines (with lower N_eff’s) are for D=32. Again the current estimate caps N_eff<=N, while true N_eff can be larger than N. In addition we see that the current N_eff estimate can also overestimate N_eff. This is likely due to truncating autocorrelation series too early.

FInally 8 schools with non-centered parametrization

data {
  int<lower=0> J;           // number of schools 
  vector[J] y;              // estimated treatment effects
  vector<lower=0>[J] sigma; // s.e. of effect estimates 
}
parameters {
  real mu; 
  real<lower=0> tau;
  vector[J] eta;
}
transformed parameters {
  vector[J] theta;
  theta = mu + tau * eta;
}
model {
  eta ~ normal(0, 1);
  tau ~ normal(0, 5);
  y ~ normal(theta, sigma);
}

8schools_neff_geyer_ipse
The difference between the estimate from repeated simulations and the estimate using Geyer’s initial positive is mostly due to finite number of repeated Stan runs used.

It’s likely that in most practical cases N_eff is rarely much higher than N, but all these results show that N_eff estimated using Geyer’s initial positive sequence is more accurate.

I also did run the same experiments with Geyer’s initial monotone sequence estimator. The results are really close to the results with initial positive sequence estimator (as in my old experiments years ago, too).

andrewgelman · December 30, 2017, 2:15am

This is interesting. We can also look into non-normal distributions. For example if a qoi is Cauchy distributed, then R-hat is useless, but one could still have success with quantile-based coverage methods, or an R-hat-like thing constructed based on MAD_SD.

syclik · January 4, 2018, 5:27am

@avehtari, just as a sanity check, does the interpretation of the more optimistic n_eff make sense when using it for estimating standard error of the mean?

I think I know how to set up a simulation to check, but I won’t get to it until after StanCon. I didn’t see that sort of verification in the thread, but it wouldn’t surprise me if you’ve already done that. (I don’t know how to sample better than true iid, but I believe that the math may work out. I’d just want a little bit of empirical validation before claiming that Stan is able to sample iid gaussian better [per iteration, not time] than an rng.)

avehtari · January 4, 2018, 1:11pm

Yes. In my post N_eff BDA3 vs. Stan - #21 by avehtari the four figures show the simulation check with realized true errors of the mean with dashed line, and as you can see the version using Geyer’s initial positive sequence is estimating that more accurately and yes NUTS can sometimes be super efficient and get lower error than independent draws.

syclik · January 4, 2018, 1:27pm

Thanks! I didn’t realize that’s what the dotted line was. How did you compute the dotted line? I just want to make sure I understand it. (I didn’t get it from the description in the post.)

avehtari · January 4, 2018, 2:14pm

Each simulation is using default Stan setting with 4 chains and 1000 draws per each chain saved. For each simulation I compute the empirical mean for each parameter. For normal, truncated normal and correlated normal I run 100 simulations, and the true mean is known exactly. For 8 schools I run 1000 simulations, and true mean is approximated by 4 million draws from the posterior. I then compute mean square error of empirical means (compared to true mean) and compare these to theoretical expected mean square error of independent draws (which can be computed analytically for normal and is estimated from posterior variance in 8 schools). Ratio of these tells the relative efficiency of NUTS sample. I didn’t post the code because it’s in Matlab, but I can post it if you want to check details.

syclik · January 5, 2018, 2:06pm

I’m starting simple and trying to understand the dashed line for the first example in this post which is just a the normal. I still don’t understand what the dashed line is. Maybe the code would help?

If you ran the same code on iid draws from normals, I’d expect Geyer’s initial positive hover close to 1 (with noise)?

avehtari · January 5, 2018, 8:18pm

Here’s the code (I used MatlabStan as recompiling summarystan binary was easier for me than figuring out what to do in rstan). MatlabStan uses CmdStan which leaves the csv files behind. Dashed line is computed using rmse and in the plotting line it’s squared to get mse and 1./4000 is the analytic result for independent draws.

model='normal.stan';
for di=1:11
    D=2^(di-1);
    for i2=1:100
        dat = struct('D',D);
        fitn = stan('file',model,'data',dat,'sample_file','schools','file_overwrite',true,'verbose',false,'seed', i2, 'iter', 2000);
        fitn.block()
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_ipse --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        q=dataset('File','bar.csv','delimiter',',');
        warning on
        neffss(di,i2)=mean(q.N_Eff(8:end,:)/4000);
        rmse(di,i2)=rmeansqr(q.Mean(8:end,:));
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_imse --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        q=dataset('File','bar.csv','delimiter',',');
        warning on
        mneffss(di,i2)=mean(q.N_Eff(8:end,:)/4000);
        ss=unix('rm -f foo.csv;/u/77/ave/unix/proj/cmdstan/bin/stansummary_old --sig_figs=8 schools-1 schools-2 schools-3 schools-4 --csv_file=foo.csv > /dev/null');
        unix('egrep -v "^#" foo.csv > bar.csv');
        warning off
        q=dataset('File','bar.csv','delimiter',',');
        warning on
        oneffss(di,i2)=mean(q.N_Eff(8:end,:)/4000);
    end
end

h1=plot(mean(oneffss,2));set(h1,'color',clrs(2,:))
hold on
line(xlim,[1 1],'color','k','linestyle',':')
h2=plot(mean(neffss,2));set(h2,'color',clrs(1,:))
h3=plot((1./4000)./(mean(rmse(:,:).^2,2))','k--');
xlim([0 12])
ylabel('$N_{\mathrm{eff}}/N$')
set(gca,'ytick',0:.5:3)
ylim([0 2.6])
set(gca,'xtick',1:11,'xticklabel',2.^(0:10))
xlabel('Dimensions D')
legend([h1 h2 h3],'Current','Geyer''s initial positive','Estimate from repeated simulations','location','southeast')

I generated 4x1000 iid draws, so that the split chain part stays the same. Here are violinplots with Geyer’s initial positive (blue) and with the current (red, with some jitter for exact 4000 values). This gives idea of overall accuracy for n_eff estimates
independent_neff

Bob_Carpenter · January 5, 2018, 9:12pm

I’m assuming the dotted line is the empirical effective error. That is, if you run 1000 simulations and know the true value of the posterior mean, you can get an empirical measurement of the mcmc standard error (mcmc_se). Then given that you also know the posterior standard deviation (sd), you can use the formula

mcmc_se = sd / sqrt(n_eff)

to solve for n_eff given the known sd and empirically calculated mcmc_se.

Is that what’s going on here, Aki?

avehtari · January 5, 2018, 9:19pm

I’m not sure how well the different line types display. Dotted line at 1 is the theoretical performance of independent draws. Dashed line which is close to Geyer’s initial positive etc. is the empirical effective error with 100 or 10000 simulations.

Yes. In normal and truncated normal we know true values and in 8 schools they are estimate with 4 million draws from the posterior.

Yes. For the normal and truncated normal I also use the knowledge of the true posterior variance. For 8 schools I estimate sd.

syclik · January 6, 2018, 3:05pm

Thanks, @Bob_Carpenter! That’s exactly what I was thinking about.

I think I’m looking for a different line. It doesn’t sound like the dashed line is actually the n_eff computed from this formula:

or going the other way:

n_eff = (sd / mcmc_se)^2

I didn’t see that sort of estimate happening in the Matlab code.

avehtari · January 6, 2018, 3:20pm

It’s these two lines

and

True mu=0 and sd=1. And since I plot N_eff/N, there’s also 1./4000. Not the most clear to write it, but I was in a hurry.

syclik · January 6, 2018, 3:26pm

Thanks, @avehtari! It’s been a long time since I read Matlab code. I’ll need a few to decompress that line. (the first line; the second one is clear)

avehtari · January 6, 2018, 3:30pm

No problem.

The first one is calling a function I wrote myself so no wonder it’s meaning is a bit unclear. I didn’t post the code in the first place as I knew it’s not that readable. It would be good to rewrite in R, but I’m still a novice and it takes so much more time for me to write code for experiments like this.

syclik · January 6, 2018, 3:30pm

What’s the rmeansqr function?

avehtari · January 6, 2018, 3:36pm

Computes RMSE, but ignores NANs. Some argiment checking plus

 sqrt(nansum(nansum(a.*a))/sum(~isnan(a(:))))

Not really needed, but it was faster to type (didn’t consider time needed to explain that line!)

syclik · January 18, 2018, 3:13pm

Here’s sort of what I was thinking… just to get a sense of whether the N_eff were truly higher than independent draws through simulation.

For this model, the N_eff is actually less than N, so that pic sense. (~ 366 when looking at 1000 draws)

Can we do something similar with something with N_eff > N? Have you seen any analytic models that have N_eff > N?

avehtari · January 18, 2018, 3:19pm

Yes, that what I was doing, but instead of plotting simulations, I just computed ratio of variances from simulations and analytic results for 4000 draws. You can pick one of the models in N_eff BDA3 vs. Stan - #21 by avehtari which have N_eff>N with some D or rho.

syclik · January 18, 2018, 3:46pm

Ok. I’m convinced.

avehtari · January 22, 2018, 1:17pm

Pull request has been merged.

Do RStan, ShinyStan and PyStan use also computation in chains.hpp?
Ping @bgoodri @jonah @ariddell

Topic		Replies	Views
Stan for Bayesian Hierarchical Models Publicity	6	1265	July 20, 2018
Stan's ESS (and monotone autocorrelations) Algorithms mcmc	5	796	March 13, 2020
Analysis API Developers	59	4420	October 22, 2018
Potential Scale Reduction and chains=1 General	7	724	November 6, 2017
Using truncated negative binom. - large time increase ~100x Modeling	21	1123	July 25, 2019

N_eff BDA3 vs. Stan

Related Topics