Control Variates for Reduction of Sample Variance in Stan

root.pdf (204.5 KB)
Following various discussions, the attached document describes our analysis of the potential utility of Control Variates in the context of Stan. We perceive these results offer compelling evidence of the utility of the approach and recommend that a new component is added to Stan to enable Stan’s users to capitalise on the approach. While we would welcome the involvement of other Stan developers in any such integration, we are able and willing to devote effort to such integration.

Comments welcome!

PS All errors are mine. Anything that is correct should be attributed to Yifan and Phil.

  1. That’s awesome.

  2. Fewer decimal places in Table 1, please! Actually, a graph would be better; see here:

1 Like

As I said in the meeting today, I can think of 4 application areas where this could be useful.

The background here is that in many applications, the improvement in computing E(theta|y) is not so relevant in practice. For example, suppose you fit a model and follow the current recommendations for mixing (, and you have effective sample size of over 100. You might then have a parameter with estimate 2.41 with standard error 0.85, and a MCSE of 0.07. Maybe the control variates method reduces the MCSE to 0.01. That’s great, but in practice it does not really affect our inferences: either way, the MCSE is much less than the posterior sd.

So, for the control variates method to make a difference, it has to be either a setting where for some reason we want extra accuracy in computing E(theta|y), or where we don’t have a large effective sample size.

Below are 4 relevant application areas:

  1. Problems where the goal is to compute E(theta|y) rather than simply to summarize the posterior or have posterior draws. Examples could be physics problems and also various problems where we work out E(theta|y) for different settings of the data and model and so we want to minimize noise in understanding these relationships.

  2. Preset models such as in rstanarm and brms, where we know ahead of time how we are going to summarize the posterior, so that all this can be done automatically, hence the cost of this approach is approximately 0.

  3. Again for rstanarm, brms, and other user-facing tools, there could be an advantage in getting more stable point estimates, just because it could bother users that if they run Stan twice, they get different answers. Having variation in the 3rd decimal place, say, rather than in the 2nd decimal place, that’s an improvement in that it makes things look less noisy.

  4. For SHMC (shitty HMC) we are interested in inference before full mixing. I’m guessing that the control variates approach will give us much better and more stable estimates, compared to simple mean of the draws, in these pre-convergence settings.

1 Like


  1. It looks like the only gradients you need are lp with respect to the unconstrained parameters. Since generated quantities are a function of the unconstrained parameters they should fit in this framework fine? (shouldn’t need any extra gradients)?

  2. Does the derivation in Eq. actually work for constrained variables? Or does it depend on the unconstrained scale? I worked through this once and I decided it works for the unconstrained scale cause you actually have infinities and you need the integral of p to be finite so you gotta have p(-inf) and p(inf) = 0. It seems like that integral should be over the support of the random variable, in which case a beta distribution might have a different values at p(0) and p(1).

Like here’s some sample code:

a = 1
b = 5
ddbeta = function(x) {
  dx = 1e-8
  lx = pmax(0, x - dx)
  rx = pmin(1, x + dx)
  (dbeta(x + dx, a, b, log = TRUE) - dbeta(x - dx, a, b, log = TRUE)) / (rx - lx)

mean(ddbeta(rbeta(1000, a, b)))

dbeta(1, a, b) - dbeta(0, a, b)

With output:

> mean(ddbeta(rbeta(1000, a, b)))
[1] -5.084931
> dbeta(1, a, b) - dbeta(0, a, b)
[1] -5
  1. “The exception is “irt 2pl”, for which D = 144: 2500 samples are insufficient to estimate a non-singular instance of the 144×144 element covariance matrix”. There’s got to be more going on there.

@bbbales2: Q1: I’ll add a revised note on generated quantities (and address Andrew’s issues with Table 1), but I think you need the Jacobian of the transformation related to the generated quantities: since E[f(x)] \neq f(E[x]), you can’t use a good estimate of E[x] to get a good estimate of E[f(x)].

@bbbales2: Q2: yes. The physics community where this comes from assume that the variables are unconstrained but you are right that it’s fairly easy to see how to correct for the “bias”.

@bbbales2: Q3: We didn’t do anything complicated to get around the fact that we couldn’t calculate a non-singular covariance matrix. I don’t follow why you think there might be more going on. Can you explain?

Oooh gotcha

Well if we have constrained variables we have to estimate that bias? Or can we get that from the unnormalized pdf?

I just had a look at the model and I’m confused as well (

There aren’t any unusual constraints on anything getting printed there. What I expected to see what a transformed parameter that was always zero or something.

It seems like the only way 2500 samples gives a 144x144 covariance matrix that isn’t full rank would be if the sampler froze up? But the way NUTS is written it doesn’t do that much. That’s really confusing to me. Maybe look at subsets of the covariance matrix to identify if it’s a piece of it that’s causing the rank issue, and then look at the timeseries.

Thank you for the write up!

In table 1, which parameters are you computing estimates for? For example, in the case of the “eight_school”, you report one value, which suggests you estimated the posterior mean of one parameter. There are however more parameters to examine (10 if I recall correctly).

Equation 14 looks fine and I think we’re looking at the right metric here. Per the second paper @andrewgelman lists, on \hat R and rank normalization, there are more sophisticated ways of calculating the variance of an estimator, or Monte Carlo error. I recommend using the R package posterior, on the Markov chains built for x^l, x^q, and x^\emptyset and looking at the estimated Monte Carlo error. I expect it will be consistent with your results.

Suppose we want to generate \tilde y, based on observations y. My intuition is you need to compute

z(\tilde y) = \frac{d}{d \tilde y} \log p(\tilde y | y)

So it’s a different pdf than p(\theta | y).

@bbbales2: i think we can evaluate the bias exactly but need to check the maths; the matrix that i am referring to and that is needed by the quadratic control variates has (about) 144x144 rows and 144x144 columns not 144x144 elements (apologies for getting this wrong in my note). That’s why we run into issues with 2500 samples.

@charlesm93: the way we calculate the variance is basically the average over the parameters. We will happily calculate other metrics but i don’t see why that will change the perception of the utility of the approach. Given that, i think i will park that under nice-to-do-later (unless you can explain why i shouldn’t)!

@charlesm93: that pdf is (i think) just the pdf for theta multiplied by the jacobian of the function that maps from theta to tilde y. Can we get that jacobian?

Oooo, that makes then sense lol.

@bbbales2: we could approximate the relevant covariance matrix (eg as diagonal or block diagonal etc) and see how well that works but better ways of progressing in such settings will be described in papers by the control variate community. I will ask around.

Hmm, I basically don’t know the control variate stuff haha. Just posting comments from reading in the meeting – I don’t really know what’s a good idea or not.

This was something @Bob_Carpenter was digging into before and so I’d expect he’d have good feedback, especially with respect to what we need to build to generalize this to whatever expectations someone wants to compute.

Also @avehtari will probably have good feedback when he goes through this.

Fair enough. The utility of these methods is that they are sensitive to potential issues with Markov chains, such as non-convergence. But given the model you’re looking at, it shouldn’t be an issue.

That’s true when \tilde y is a deterministic function of \theta. When \tilde y is a random variable with a distribution that depends on \theta, the wanted Jacobian is ill-defined. You need to differentiate \pi(\tilde y | y) = \int \pi(\tilde y | \theta) \pi(\theta | y) d\theta. I’m not sure there is a slick way of doing this.

Cool. Thanks for writing this up so precisely and doing the evals on a small set of models we can understand.

It’s not a smooth function with a Jacobian. With generated quantities, we draw \theta^{(m)} \sim p(\theta \mid y) and then \tilde{y}^{(m)} \sim p(\tilde{y} \mid \theta) by \tilde{y}^{(m)} = \textrm{rng}_p(\theta). We don’t know how to autodiff through rngs.

If we were instead to code \tilde{y} as a parameter, it’d just be a joint posterior p(\tilde{y}, \theta \mid y) and we’d have all the derivatives we need.

I don’t see an easy solution for generated quantities.

I was just trying to catch up. Which is why I was working through the math on the board in the playroom with Andrew.

What I was saying is that the optimal control variate depends on the function being estimated. @s.maskell — was there a different control variate used for each parameter?

Did you do something simple like condition the resulting covariance matrix estimate?

That’s for estimating the error given the draws. Here we have a collection of estimates, so it’s a different problem.

Would it be possible to do a histogram per model instead? Or at least show the 5% and 95% or min and max? What we find in a lot of models is that we get low effective sample size for hierarchical parameters and high effective sample size for the lower level parameters.

I’ll post again with direct comments on the paper.

I don’t even see the easy solutions for transformed parameters. It seems like we need the partial derivative of target with respect to the thing declared in the transformed parameters block and for that, we would need to walk through part of the autodiff tree rather than all the way back to the parameters or the unconstrained parameters. I don’t think we have anything exposed that does that currently.

The transformed parameters are implemented as autodiff variables, so we’d have their adjoints w.r.t. the log density. Is that what we’d need? As soon as we start talking transforms, I get all turned around!

I think we just need to expose a way to get the intermediate gradient calculations. If

parameters {
  real mu;
  real<lower = 0> sigma;
  real theta_raw;
transformed parameters {
  real theta = mu + sigma * theta_raw;
model {
  target += std_normal_lpdf(theta_raw);
  target += // likelihood involving theta

we can call the gradient function to get the derivatives with respect to the three things in parameters (or their unconstrained parents). And that results from differentiating target wrt theta and then chain-ruling. But we don’t have a function exposed to get the derivative with respect to theta alone.

We only have the log Jacobians for the inverse transforms (the ones that go from unconstrained to constrained). Those are all implemented in stan/math/prim/fun.

In (14), aside from summing, should we be using the sample mean or the actual mean for evaluating?

And it looks like you’re reporting average variance, but then labeling it \sigma^E, which is confusing. These aren’t standard errors but standard variances. That would bring the results more inline with what I was expecting from reading the control variate literature. It is neat that the low dimensional Gaussian really does work so well given that’s how the method’s designed.

The calculations can be costly relative to sampling in high dimensions as the covariance matrix gets big. But then I see the topic of diagonal or low-rank approximations has already come up.

And I still dont see why IRT 2 PL didn’t have positive definite sample covariances. It really is only 144 parameters according to the table. It does have banana-shaped posterior components because there are weak multiplicative non-identifiabiliies (in the ability, difficulty w.r.t. the discrimination parameter). There’s a good and a bad way to parameterize that model and I’m not sure which one Michael went with for testing. The difficult term is a[i] * (b[j] - c[i]) and the better version takes a[i] * b[j] and lets ac[i] be a parameter instead of having an a[i] * c[i] term, which is really bad.

The vector of derivatives is the same for each parameter to be estimated, but the coefficients differ: that comes out of how we define the value of what was called \alpha in the note.

No, we didn’t, but we could and should try that. I don’t know which approximations will work best though we could find out.

We certainly could, but I’m keen to decouple interesting things to understand from those things that will alter our collective perception that this is useful. I’m not currently clear what we need to do and what would be interesting to do. I’m happy to do both eventually, but want to prioritise the things that people feel need to be done first.

@charlesm93 and @Bob_Carpenter: Thanks for explaining how generated quantities work. It seems to me that we could use a suitably defined joint posterior, but I’m far less familiar with Stan than you and would happily be corrected.

Sorry. My bad. What notation would you have preferred us to have used?


The model does “only” have 144 parameters, but for the quadratic control variates you calculate quantities that relate to pairs of parameters. You then need to work out the covariance of those quantities. That make more sense?