Control Variates for Reduction of Sample Variance in Stan

Thanks Simon and co for putting this together!

Sorry for being late to join the discussion. I’ve been working on control variates recently so I have a couple of thoughts and I’m also hoping to contribute other control variate methods to Stan.

Regarding the setting Andrew mentioned where we haven’t yet reached convergence:
The choice of coefficient that’s being implemented here is optimal in the setting when you have IID samples from the posterior. If you’ve achieved convergence and have reasonable mixing then it can be very useful in practice. However, if you haven’t reached convergence yet then I wouldn’t necessarily trust it to be more stable. There are alternative approaches that should be more stable though (see item 2 in the list below).

Regarding the setting where the dimension is high relative to the number of samples:
We have a paper looking at this setting (https://arxiv.org/abs/1811.05073). Essentially, the current coefficients of alpha = cov/var are the ordinary least squares estimates from a linear regression. We’ve looked into trying ridge regression and lasso instead to help in this setting. We’ve also looked at how we can do control variates when only a subset of the derivatives are available.

Some additions that I would be keen to add if people are interested:

  1. The regularised approach which is helpful for higher dimensions
  2. A non-parametric alternative called control functionals which offers better convergence rates and perhaps more stability under non-convergence
  3. A new method we developed that generalises existing ones (https://arxiv.org/abs/2002.00033)
2 Likes

This is something I’ve been thinking, but with limited time never I had time to anything, but really cool that you have done this and have a framework to test this with many many models.

I agree and add that with small sample size near 100, it would be useful to get more accurate tail quantiles or probabilities, but tail quantiles are not direct expectations and probabilities are an expectation of non-differentiable step function.

Based on the results additional constraints are that the posterior is not too high dimensional and relatively close to Gaussian. Often we see slow mixing and small effective sample size when the posterior is far from Gaussian e.g. multimodal or banana shape. It would be useful to have more that kind of examples. The best setting probably would be posterior which is not far from Gaussian posterior and good mixing, but very expensive log density evaluations, e.g. with ODE models. If the posterior is relatively close to Gaussian there would be possibilities to estimate tail properties with some smooth functions of x.

I agree with 1-3, but in (4) in pre-convergence setting the expectation of control variate is not 0.

So in a linear case we need more draws than the number of parameters? Or higher effective sample size than the number of parameters? This would limit the usefulness to low dimensional cases with very expensive log density evaluations. Unless low rank or structured covariance matrix is used?

This is also one restriction I had in my mind.

Oh, when I started writing all above I didn’t notice @LeahSouth’s excellent reply, which provides additional ideas which might help!

1 Like

@avehtari: we have results for 100s of stan files but the stated exam question was about using an insignificant computational time to achieve variance reduction in the context of the models we pre-agreed we should consider. We chose to report results for two specific methods to highlight that different methods give rise to different benefits and different constraints. As illustrated by @LeahSouth, there are many more than two methods out there. My understanding is that the process of changing Stan’s functionality (understanding which is actually what motivates my current focus on this) demands we answer the question of whether or not control variates are worth accommodating in Stan.

To be slightly provocative: does anyone not see compelling evidence that we need control variates in Stan?

Regarding the difficultly mentioned by Aki of doing this in high dimensions: In many cases I think it could be fine to just do the linear and not the quadratic control variates. This would remove the K^2 cost.

I agree that the first order approach could be the best in some settings, e.g. where N is very small relative to dimension. However, the second order approach can offer huge improvements in a lot of settings, e.g. it will lead to exactness for quadratic functions rather than just linear functions under Gaussian targets with iid samples. I think it’s worth having an implementation of second order polynomials for higher dimensions for these reasons and because the additional cost of performing the second order approach will be small relative to the overall cost for many examples like in ODE models.

The main difference between the first and second order approaches are that the “covariates” in the first order polynomial are just the gradients of the log target wrt each dimension but the covariates in the second order polynomial also use the samples themselves.

I normally think of the cost being K^3 + NK^2 where K is number of covariates (d+1 in the first order approach), though I guess it’ll be lower with a good solver. There are other approaches that have better than d^3 in computational complexity but their statistical efficiency often scales worse with dimension.

1 Like

Leah: Yes, this makes sense. I guess if it’s to be implemented in some general or automatic way, that we’d like a setting to turn it off if it might be too expensive, or maybe a setting for quadratic, a setting for just linear, and an off setting?

1 Like

Andrew: This sounds like a good idea to me. I like the idea of being able to choose between running either approach, both or neither.

@s.maskell Here’s my read, take it with a grain of salt. It sounds like participants in this thread are mostly on board with implementing it. We need to work out the details, the scope and applications of the method, in part to document when users might want to use it – and also for our own sake. I have a few models in mind where I would like to try it!

(actually, one of them was the ising model, but not luck there, since the gradients are not defined… but that’s my own non-Stan specific work)

@charlesm93: Thank you. That’s very helpful. I also sense people are enthused and that’s really great. My understanding is that the process we should follow (or explicitly deviate from, I guess) is that defined here: https://github.com/stan-dev/stan/wiki/Proposing-Algorithms-for-Inclusion-Into-Stan. Are we now happy to migrate from collating the evidence to the design phase? If so, does a “design proposal” exist for something else that we can use as a basis for development and/or how do we identify contributors to that design proposal?

To clarify, while I’m also interested in the specific algorithm (and would love, in slower time, to follow up on some of the interesting queries that people on this thread, including you, have identified), my core focus here has to be on understanding how to navigate the process of augmenting Stan (because I have significant effort focused on other potential augmentations where I anticipate the decision making will be harder). I’m therefore keen to decouple activity that isn’t core to augmenting Stan from the things that we need to do to get control variates in there.

I’m not certain why you tagged me as there was no question for me (or was there?) It seems we agree on most of things and I’m excited about the possibilities.

I’m not in any role for making decisions, but I think it’s good to follow that and keep doing what was planned before

  • make the testing framework publicly available so that anyone can test with their favorite posterior without need to ask you to test yet another case
  • publish results for more posteriors with additional information when this works well and when not

It’s great that the algorithms can be tested without changes to Stan. I’m convinced that it can reduce computation time to get desired accuracy. After seeing these results and seeing that there can be some constrains or limitations to the generic use, I would add to the plan

  • Demonstrate at least one practical example how this can be used for the whole inference. Not just E where x is in unconstrained space, but other summary statistics of the posterior and posterior predictive distribution (e.g. arbitrary quantiles). It seems there were many questions related to what is required from the user or from code to compute some things using autodiff. Seeing an example of complete workflow with points where control variates can help and what is the required additional information, would be useful to have answers to better understand the design needs for software (including Stan core and interfaces).

It seems this would be feasible with the current testing framework you have? I think it would be useful to list based on the discussion in this thread, what are the parts, needed to support the usual complete analysis workflow, which can’t be done using your current approach using PyStan and would either require additional user defined components are additional code inside Stan.

Additional thought for gaining more understanding when this is useful would be measure of how close to Gaussian a posterior is and how the posterior is deviating from Gaussian. This would be useful in other contexts, too, as it would be then possible to see in which cases the benefits are biggest and when not.

I feel like having stored gradients of the log target for each of the samples would be the main thing required for efficient implementation. @s.maskell, would you agree?

If the log target and its gradients are specified in terms of an unconstrained space then that’s probably preferable. You can use these control variates directly in the constrained space but I typically find the MSE reduction is less significant and also the regularity conditions required to have zero expectation of the control variates seem to hold less often in the constrained setting.

Regarding practical examples besides E[x]:
This approach can only be used for expectations of functions h under the target pi. The improvement depends on how smooth pi and h are. Posterior (co)variances would be easy to look at for all examples but there are lots of interesting example-specific functions too. Unfortunately you can’t use this approach directly for quantiles and it won’t work very well for estimating tail probabilities because of the non-smooth h.

1 Like

The samples and gradients for the unconstrained parameters are available in the diagnostic_file option in cmdstan or the save_extra_diagnostics argument in cmdstanr (and probably cmdstanpy has this too). This does not included transformed parameters.

Right now it’s awkward to convert things in the unconstrained space to constrained parameters. I think Rstan and Pystan have utilities for that.

1 Like

During the sampling the gradients are computed for each leapfrog step. Each iteration makes usually 5-1000 leapfrog steps. Storing the gradients for all iterations would give maximum 0.1%-20% speed-up compared to re-computing the gradients after the sampling. I don’t think it’s worth the additional memory and disk space usage and I/O.

I understand this, and that’s why I’m hoping to see a practical case example with interesting example-specific functions. I’m certain there are such, but I don’t what they are.

Thanks @bbbales2 and @avehtari! Apologies for my lack of understanding on how Stan works.

That makes total sense that it’s not worth storing the gradients because multiple computations are required per leapfrog step.

The best example-specific function I can think of at the moment is for logistic regression: estimating the probability of y=1 for a new X* in logistic regression.

1 Like

If every RNG corresponded to a differentiable log density, then yes, it could be done in theory. Our current C++ code generation isn’t up to the task, though.

There are also discrete quantities getting generated here, primarily for categorical predictions and indicator functions for event probabilities. This is what Aki’s getting at with step functions for probabilities, like \textrm{Pr}[\theta > 0.025], which will have a generated quantity that’s declared as an integer.

It’s just that a lot of the parameters mix very well. But we wind up waiting because of the slow one. So min is better than median ot mean if you want a single number out.

For the plots and reports of improvement, standard error is much easier to interpret than variance because it’s on the same scale as the quantity of interest. And it’s what Mira used in the paper you referenced. But then reducing standard error requires quadratic numbers of iterations whereas reducing variance takes linear numbers of iteration, so it’s more on the scale of how much longer you’d need to run Stan. But in the end, what we really care about most of the time is getting to an effective sample size of roughly 100 per chain reliably, so that’s standard errors around 1/10th of the size of the posterior standard deviation.

Yup. Got it. Just for reference, most of our models of interest have more than 144 parameters, so all these test cases would be considered fairly low dimensional. It would be nice to see if this works when there are 10,000 parameters, which is not unusual in something like an IRT model or a gene expression model or a social science model with lots of interactions.

Thanks for joining in and thanks for the relevant references.

Is the main problem the one Aki brings up that the expectation of the control variates won’t be 0 until convergence?

That’s for the parameter-specific coefficient that goes into this particular kind of control variate? We’ll be within 10 or 20% of that long before any of our convergence diagnostics are satisfied. HMC gets to the typical set very fast. We don’t trust anything before convergence, but if it helped get reasonable answers faster, that may be the biggest win for it as @andrewgelman was saying. Usually we don’t need the extra decimal place of precision you get from moving from an ESS of 400 to an ESS of 40,000.

What’s the multiplier? We’d like to be able to get away with a few hundred draws for models with a thousand parameters because that’s all we need for most of the inference we do.

Keeping in mind that we don’t need high decimal place conversion, can it work with a few hundred or at most a few thousand iterations?

I’ve been asking about that, too.

I think everone’s on board with this. It’s not 100% clear that they’re going to be super useful for the reasons Andrew brings up and because we don’t know how to help with generated quantities (where 100% of our predictive inference happens) and transformed parameters (less of an issue, I think).

Also, can we handle transformed parameters by just transforming the control variates, too?

Are there specific ODE models people have in mind here? Most of the ones I’ve been involved with have patient-specific effects that lead to hundreds or thousands of parameters. Not too many parameters per ODE, because each ODE solve happens for a single patient w.r.t. their parameters.

Our dense metric adaptation, if turned on, does cubic work in the number of parameters. Isn’t the first-order case already quadratic in computation? We just can’t make anything that’s worse than linear in the parameters a baseline setting anywhere or we’ll wind up with models that don’t run.

I think it’s getting there, but would emphasize Aki’s point about demonstrating a useful end-to-end example.

The core part of the design has to be about how we deal with only getting expectations rather than draws out and how we report things that way. We currently provide posterior mean, posterior median, estimated std error, posterior std deviation, and whatever quantiles the user want. It sounds like this will give us a more precise mean for parameters (and maybe transformed parameters) but not generated quantities. So it could be a drop-in replacement for what we have now in summary.

One thing that’s going to get weird is that if I have a parameter alpha and a generated quantities block that just copies it.

parameters {
  real alpha;
  ...
generated quantities {
  real alpha_copy = alpha;
  ..

Our posterior summary will look different for alpha and alpha_copy.

There’s a design-docs repo in stan-dev where proposals like this can be made or you can start a draft for feedback.

If the proposal’s to do this all without touching the core services in a way that it’s either a post-process like standalone generated quantities or at least optional, then it’s going to go down a whole lot easier.

The balance here is that the better the feature is in terms of utility, the more pain people will be willing to put up with to have it.

Yes, and that’s already there for diagnostic purposes. It’s stored gradients on the unconstrained scale, though, and our results are all reported on the constrained scale w.r.t. which the user writes the model.

The Stan interfaces only report results on the constrained scale. Many of our users are probably not aware there’s an unconstrained scale we transform to in order to simplify our algorithms. So we only really care about results back on the constrained scale. Many of our parameters, like regression coefficients, are unconstrained, but many are things like covariance matrices or simplexes or even probabilities or scales.

write_array in the model class does that, but it outputs them all as a single sequence, not with structure.

That’s a really good point. We only need the gradients at the draws, not all the intermediate points. Maybe there’d be an easy way to just store that?

Ok. We will report min, median and mean next time.

Thanks. That’s particularly helpful.

We did think that. There are test cases in the list of 100s we’ve also looked at which are higher dimensional (though I don’t know if there are any that have 10,000 parameters). Maybe we should prioritise some of those to migrate into the curated set.

I’m pretty certain that we can do this.

Hooray!

Yes. That’s what we are working up now.

Sorry in advance for the long message. To preface my comments below, I realise that I keep talking about new methods which are probably beyond the scope of what we want to do right now. I think there are solutions to many of the potential issues that are being raised, but there’s not really a single solution that fixes each of them. I feel like it might be a good idea to start with first and potentially second order approaches as an option for people to use and then I’d like to try adding other approaches that help for specific cases later. In other words, I agree with @s.maskell.

To me the issue is:
Our estimates are never going to be unbiased (because it’s MCMC) but control variates may have better MSE than the vanilla approach if the coefficients of the control variates are chosen well. The difficulty in the pre-convergence setting is that the coefficients won’t be chosen well for this type of control variate because they’re designed for iid samples from pi. The possibly more stable approach that I mentioned (control functionals) is based on minimising an RKHS norm and it doesn’t rely on the samples used to estimate the control variates being drawn from pi. It has its own downsides though, like typically worse performance in high dimensions.

The zero expectation assumption is that the control variates have zero expectation wrt pi so it doesn’t depend on how the samples are obtained. If it holds then you get unbiasedness under iid samples (ignoring another condition that you can easily enforce). In our setting with a single set of MCMC samples, the interpretation is slightly less clear because you’ve got bias anyway. I guess the intuition is that if the zero expectation condition doesn’t hold then you may be introducing more bias than what you had with the vanilla estimate.

Yeah, sorry I’ve been a bit unclear when talking about coefficients but that’s what I mean.

Without using any kind of regularisation then I guess it has to be >1 (i.e. N>d). I guess it’s hard to say in general. I typically prefer for N to be an order of magnitude larger than d but that’s definitely not a requirement in order to get improvements.

More on this and alternatives below…

Yes it should be feasible to run control functionals for a few hundred iterations or thousand iterations. The main cost is inverting an N by N matrix.

I agree, we can apply control variates using the samples and the gradients of the log posterior in the constrained space. It’s just that the improvements might not be great and we’re more likely to be introducing bias through the zero expectation condition not holding.

I agree that this is probably better as an option rather than as a replacement given how high-dimensional many examples are in Stan. The complexity is cubic in the number of coefficients to estimate for this type of control variates. The number of coefficients is d+1 for the standard first-order approach and (d+2)*(d+1)/2 for the second-order approach.

Penalised regression approaches can help with statistical efficiency but the ones I’ve looked at won’t help with computational complexity.

Having said that, you can easily reduce the cost by using only a subset of dimensions in creating your control variates. If you have a reasonable idea of which variables are important for predicting your function of interest then you can just use those. For example, if you’re interested in just the posterior expectation of X_1^2 then you could just build control variates using X_1 so you have a computational cost that doesn’t depend on dimension. This also means that you can potentially get some improvement in the setting where N<<d.

Instead of terms biased vs. unbiased it might be useful to use terms big bias (early part of MCMC) and negligible bias (late part of MCMC if everything goes fine).

1 Like

I don’t know how I can work in this field and so regularly forget that the expectation of a transform isn’t the transform of the expectation! Given this, why would we expect transforming the control variates along witht the variables to work?

We can’t even afford cubic in the number of dimensions in general. We regularly fit models with 10K or 100K parameters.

That’s a good point. We almost never care about 10K parameters even in models that have them. There are particular expectations of interest that we will care about. But we do want everything to converge to make sure we’re in the right part of the solution space.

1 Like

I think I misinterpreted what you meant by transforming the control variates, sorry. I was thinking about building the control variates based on transformed versions of the variables, rather than transforming the control variates after the fact. I’m not sure if this helps, but say you’d done a transformation phi = log x. You could estimate E[phi] = E[phi - alpha * cv(phi)], which of course wouldn’t be very helpful, but you could also estimate E = E[exp(phi) - alpha * cv(phi)]. You don’t need to transform the control variates but the estimate of alpha will be different and you need the gradients of the transformed log posterior wrt h (I know this can be an issue in Stan, so this may not help anyway).