Control Variates for Reduction of Sample Variance in Stan

We compute that internally because we’re using reverse mode autodiff and get the derivatives of the log density w.r.t. every subexpression involved in its computation. But we don’t expose it anywhere.

The bigger issue is generated quantities, which is where a large number of the quantities of interest are defined. Those can be defined using random-number generators and there’s no easy way to get gradients of anything there.

Yeah I understand that the gradients that we’d want for control variates won’t always be available. Unfortunately there are going to be lots of situations in Stan where we can’t use these control variate ideas, but also lots of cases where they will be helpful.

I think it’s clear how we can develop an augmentation of cmd-Stan (assuming that’s where the C++ should go) that uses the diagnostic outputs that are already produced (we think) to generate the control variates and so improve the ability to estimate constrained parameters (relative to just working out an average of the sampled constrained parameters). In response to previous comments (and modulo slow progress due to current (sudden) demands on our team’s time), we are working up an example of how to do that (complete with options that a user to define that would make it possible for people to avoid calculating 10K parameter estimates via a method that involves 10Kx10K control variates: it does seem unlikely anyone would really want those estimates). As I understand it, for situations where the generated quantities could be part of the parameters, we can advocate doing so. That might not be pretty and we might want to do something better down-stream, but it seems like a pragmatic way to go.

What I don’t (but would like to) understand is what use cases exist (and are relevant to large numbers of users) where that would not be possible. I also don’t know whether control variates are applicable in those settings. My hunch is that if the generated quantities can’t be described as parameters (eg because they are realisations of a discrete variable), then you can’t use control variates to improve those estimates. Is that right?

Sorry if I’ve misinterpreted your post @s.maskell, but I’ve included below my thoughts on applications to steer clear of (with direct quotes to avoid misquoting people):

  1. Generated quantities:
  1. Expectations of discrete functions of our random variables (performance of the regression will be terrible and will probably make things worse)

  2. I think it’s best to focus on examples where the state space is R^d. If we have constrained variables then we can work directly in the constrained space, but the MSE reduction will probably be less significant and the regularity conditions might not hold. The better alternative would be to work in the unconstrained space but accessing the gradients for that in Stan is non-trivial:

@LeahSouth: I think we’ve got a bit of cross-talk going on here. As I see it, there isn’t one consistent conclusion that stems from the posts on this thread as to what we can do and how we should do it, not least because they were all written at different times (and because this stuff is hard). However, I think I can see a way to make a lot of it work. My plan was to have a go and report back. That motivated my most recent post. Maybe my query was repeating old ground, but I suppose I’m trying to understand what users do, rather than what Stan allows.

@Bob_Carpenter: From what you’ve said, I guess lots of users consider generated quantities. In any case, if we can estimate parameters better but couldn’t handle generated quantities, am I right to think it would still be useful to (use control variates to) do so?

Preferably one layer lower in stan-dev/stan (that is, not in stan-dev/cmdstan). If it goes in the cmdstan repo, none of the other interfaces will see it.

This is currently a mess in our code and what winds up happening with these things is that people implement them multiple times, one for each interface.

Pulling them up to the C++ level in stan-dev/stan would mean providing functions that all the interfaces could use.

Isn’t there just one control variate per parameter?

What does this mean? Generated quatities is a separate block in Stan that’s executed with double types using pseudo-RNGs, not sampling. Their values are output like parameters, but they’re not deterministic functions of the parameters like the transformed parameters are.

Where what wouldn’t be possible? That was the first sentence in the paragraph.

Maybe it’ll help to be specific. If we want to evaluate \textrm{Pr}[\beta > 0 \mid y] where \beta is a parameter and y the observed data, then we would have something like this:

data {
  int N;
  vector[N] x;  
  vector[N] y;
  int N_tilde;
  vector[N_tilde] x_tilde;
}
parameters {
  real alpha;  real beta;
  real<lower = 0> sigma;
}
model {
  y ~ normal(alpha + beta * x, sigma);
  { alpha, beta, sigma } ~ normal(0, 2);
}
generated quantities {
  int<lower = 0, upper = 1> beta_gt_0 = (beta > 0);  // event probabilities
  vector[N_tilde] y_tilde = normal_rng(alpha + beta * x_tilde, sigma);  // predictions
}

The first generated quantity variable alpha_gt_0 is for the event probabiltiy—it’s posterior mean will be the event probability \textrm{Pr}[\beta > 0 \mid x, y]. The second is for posterior prediction. We are very interested in posterior means for beta_gt_0 and interested in estimates for the various y_tilde[n].

I don’t see how to use control variates to help with either estimate. But they’re very often the focus of estimation, especially in predictive settings.

Our users care about estimates of their parameters in the constrained space. The unconstrained space we use for sampling is just a convenience and nothing users ever see. All the transforms are implicit results of defining constrained variables. For example, the sigma parameter above is constrained so that sigma > 0. The user does not want an estimate of log sigma, they want an estimate of sigma.

Derivatives w.r.t. transformsed parameters are available internally, but as @bgoodri notes, not yet exposed anywhere.

Probably. People often want good parameter estimates on the constrained scale. And that includes transformed parameters. Some generated quantities, like beta_gt_0 above are really just transformed parameters (though they may be discrete) and others involve random number generation.

The existential question is the one @andrewgelman raised, which was when control variates would be useful given that we can’t even diagnose convergence until we have at least an ESS of 400 (100 in each of 4 chains, where 100 seems to be about the lower limit of ESS that we can reliably estimate—this last bit on reliable estimation is hearsay from the reliable source of @avehtari [but don’t blame Aki if I misremembered]).

So the real question is when do we need ESS >> 400? For most of our applied stats problem, an ESS of 400 is fine, as that implies a standard error of 1/20th of a posterior standard deviation.

Sorry perhaps I should have been clearer. I was just suggesting that it might be good to focus initially on examples where the state space is already R^d. This would indeed cut out examples where you’re trying to estimate noise, so perhaps it’s too restrictive. As I mentioned earlier, there are ways to work around constraints by either using the gradients wrt the untransformed variables (with worse performance and assuming they’re exposed) or by exposing the gradients wrt the transformed variables.

As you can imagine, there’s no easy way to say strictly when they will and won’t help and by how much. However, I think it’s safe to say the specific control variates mentioned in the document here won’t be that useful prior to convergence (for reasons I mentioned earlier). There’s an alternative approach that might be helpful prior to convergence for smallish dimensions. I don’t think there’s any method that’s going to go very well if you haven’t got any samples near the important regions.

It’s not so easy to carve out a subset of parameters on which to work. I don’t think it’d make sense to release a solution that only worked for unconstrained parameters.

We also deal with constraints/transforms for probability and correlation parameters (lower and upper bonds), simplexes (stick breaking), covariance and correlation matrices (Cholesky factors and constrained Cholesky factors), and so on.

The only derivatives that are currently being exposed are those w.r.t. unconstrained parameters. The derivates w.r.t. the constrained parameters are available internally as that’s what all the functions in the modeling block interface with.

Don’t worry—I’m not expecting magic.

1 Like

We aimed correctly, but missed! Thanks for the course correction.

For linear control variates, yes. For quadratic, no: there are quadratically many. So, we need an expert user to be able to select which control variates and estimates are produced: I imagine we’d have sensible defaults but the ability to override them if you really did want to estimate all the (constrained parameters) using quadratically-many control variates.

Yes. Thank you for the patient and clear explanation of generated quantities. I don’t think it would be easy to use control variates to help with those quantities in the short term and suggest we focus (exclusively) on parameters (including those that are constrained and not only those that are unconstrained). I think that’s very doable, even if we only have easy access to the derivatives with respect to the unconstrained parameters.

I think people may think they do but whether they are wise to think that is an interesting question. I suspect a better reason to want to use control variates is to produce repeatable outputs. However, if people want something and we can help them have it, I do think it’s worth giving them what they want, even if it is a bit of a ghost of a benefit.

What does that mean? There’s only one parameter and 1^2 = 1. What’s quadratic in what?

This is a perennial source of contention among the devs and the statisticians. The pedagogical problem we face is that people believe multiple decimal places on posterior means are meaningful, when the uncertainty is dominated by the posterior standard deviation. When users see 5 decimal places of accuracy reliably in the mean, there’s the danger they’ll start believing those decimal places are meaningful. Letting them vary run to run reinforces the uncertainty.

I’m not saying we shouldn’t do this, but just that “what the users want” has never been a compelling argument among the devs in and of itself.

Sorry; should have been clearer. What I called the quadratic approach is what @LeahSouth has been more carefully referring to as the second-order approach. The number of control variates you need is

where d is the dimension of the state, ie the number of control variates is “quadratic” in d. You can approximate with a (small) subset and that will often result in negligible reduction in performance gain: for example, if you think of the variates as being elements in a (square) cross-covariance matrix, you could avoid considering all the off-diagonal terms and only look at those on the diagonal.

Interesting. Thanks for pointing that out. I assume the people that believe the posterior mean to 5 decimal places don’t look at the posterior variance at all.

Going back to this and slightly off-topic (sorry), I’m now pondering how generated quantities relate to the (now perfectly sensible in my view) argument that accuracy on the estimate of the mean isn’t likely to be helpful if the estimation error is small relative to the posterior variance. While it seems to me that we are converging on the view that control variates can’t help with generated quantities, am I right to think that accurate estimates of generated quantities, in some cases (eg \textrm{Pr}[\beta > 0 \mid y] for certain \beta), might be viewed as a sensible rationale for wanting ESS>>400?

Thanks for the clarifiation. I hadn’t realized that was introducing a quadratic number of control variates. d^2 is definitely too much for high-dimensional models. Also, Stan only implements second-order derivatives for a subset of functions. None of the ODE solvers, etc., have second-order derivatives implemented yet.

Maybe. But you still have the problem that there’s posterior uncertainty in that probability estimate.

It doesn’t matter, but it might help your understanding to explain you don’t need those second-order derivatives for second-order control variates. Second-order control variates need to consider the product of the ith element of the state and the derivative of the log pdf with respect to the jth element of the state for all i and j.

Good point, but it does imply to me that ESS>>400 can be useful.

The posterior standard deviation doesn’t go down with higher ESS, only the posterior standard error does. ESS = 100 is usually enough, because that makes std error = 1/10th of posterior std deviation. One could go to ESS = 10,000 and get that to be 1/100th instead of 1/10th, but either way, the posterior standard deviation dominates uncertainty.

ESS = 400 derives from wanting to run 4 chains and it being challenging to reliably estimate ESS when ESS << 100 per chain.

@Bob_Carpenter. Ok, but I actually meant (my apologies for being unclear) that for accurate estimates of some generated quantities, eg p(\beta>0|y) for some \beta, wouldn’t we want ESS>>100?

That is what I meant. There’s uncertainty in the event probability estimate Pr[beta > 0 | y] arising from y only being a sample (assuming the model’s correct to start with!). You can estimate the probability to many digits, but given that y is random, those extra digits don’t mean anything.

Here’s a simple textbook example. We want to estimate the expected weight of a mouse in a clinical study. We have N = 200 mice with measured weights y[1:200]. We assume y ~ normal(mu, sigma) [perhaps with some priors on mu and sigma] and then estiate Pr[mu > t | y] for some fixed threshold t. The problem isn’t that we can’t estimate Pr[mu > t | y] to high precision, it’s that estimate varies by our random choice of y. So while we can get a whole lot of precision in Pr[mu > t | y] for a given sample y, that’s not very useful because we expect different results with a different sample y. How much different? Well, you can run a bootstrap like for any other estimator and figure that out. In these cases, you don’t want to report much more accuracy than your bootstrap interval supports.

@Bob_Carpenter thanks for explaining this and apologies for being confused and/or confusing.

In your helpful mouse example, if we want to know mu and that is the mean of the true weight of the 200 mice, would ess>>100 help? If we want to make a decision that only relates to those 200 mice (eg whether to feed them) that might be a useful thing to do.

If (as seems likely) we are actually interested in making inferences about generic mice (from observations of the 200) then i’d be tempted to define a hierarchical model that distinguishes the mean of the 200 mice from the mean of generic mice (which we might call mu’). If we then want to know Pr(mu’>0|y), then wouldn’t ess>>100 help in that model?

Let me start by pointing out that I’m not the final arbiter of what’s useful here.

To me, these cases all seem the same. We can keep reducing standard error, but we’re saddled with the posterior standard deviation. Beyond ESS = 100, when standard error is 1/10th of posterior standard deviation, it’s hard to motivate. At this point, I’m just repeating myself.

Even if we’re trying to estimate Pr[mu’ > 0 | y], the uncertainty in the result is going to be governed by the data sample y, which determines the posterior standard deviation. Another way of looking at this is with a bootstrap—that’d give you the standard error on your expectation estimate.