Intermediate between mean-field and full-rank ADVI



I’m new to using the VB algorithm in Stan. I’m wondering if it’s sensible (and possible) to use an intermediate between the mean-field approximation and full-rank VB. I’m working with a latent variable model with a large number of parameters (10k+) and I’m getting promising results with the mean-field approximation. I’d like to use something closer to the full-rank approximation; however, as expected, it isn’t feasible with this many parameters due to memory constraints. I have prior knowledge about which subsets of parameters I expect to exhibit correlated posteriors. Is it possible to express this somehow in my Stan model? In case it matters, I usually use PyStan.


1 Like

Low-rank covariance matrix approximation would be sensible and there are a few papers demonstrating useful results. Currently this is not implemented in Stan. It would not be a huge amount of work to add it by modifying the current fullrank algorithm, but that much work anyway that I don’t know who would have time to do it in near future. This could be potentially a good project for someone new interested in vb.


Thanks for the answer. Where would one start if they were interested in contributing a low-rank algorithm? I just read over some of this code here, so based on this I guess it basically amounts to mostly implementing this interface, and properly managing the various lower Cholesky factors in block-diagonal form? I guess maybe deciding on the API (for describing the desired blocks of variables) would be tricky too. My C++ skills are rusty but I’m considering working on it in my spare time. Where does hypothetical feature discussion normally take place - is opening a GitHub issue appropriate?


I was thinking a simpler form of diagonal + reduced rank covariance. See, e.g. and

It’s better to continue discussion here until there is a clear plan what to propose in an issue.


Got it, thanks for the refs and advice. Ong et al. seems like a reasonable approach. Equations (9) and (10) are easily implemented within the current framework (using the pre-computed gradient wrt mu by stan::model::gradient).

My proposal would be this (directly mirror Ong et al.):

  1. Implement a new variational family normal_lowrank, with a mean parameter of dimension n, a low rank factor matrix of dimension n x r, and a std vector of dimension n.
  2. Implement the gradients with respect to mu, B, and d. The gradient of mu is just the model gradient (as in both the fullrank and meanfield implementations). B and d are given in the paper, and rely on the precomputed mu gradient. There is a computationally efficient form for the inverse in the second terms of the gradients wrt B and d (Woodbury formula turns it into the inverse of an r x r matrix).
  3. Implement the entropy computation (for computing the ELBO). The matrix determinant lemma turns it into an r x r determinant.
  4. Implement the transformation. The transformation relies on an r + n dimensional sample (z, eps) from N(0, I) and has closed form given in para 1 of section 3 (mu + B.z + d*eps).

This covers most of the math, I think. It seems that all of this can be done by mirroring the current fullrank and meanfield implementations in `src/stan/variational/families’, plus some new boilerplate (e.g. confirming the rank is strictly less than the dimensionality).

The part I’m not sure about is outlining a plan for how to link the algorithm to the interfaces/compiler. Hopefully this is less intensive than implementing the math. Any guidance here?

Thanks again.


Looks good.


Which interface you are using? It might be easiest to work with CmdStan when adding this kind of new feature. See cmdstan/src/cmdstan/command.hpp how variational algorithm is called and cmdstan/src/cmdstan/arguments.hpp for argument definitions. Modifying these you can add new argument for rank. You need to add lowrank also stan/services/experimental/advi, and see tests in test/unit/variational and test/unit/services/experimental/advi. The output is same for meanfield and fullrank so there is no changes needed when adding lowrank.


Cool, I’ve implemented a first pass of the math and some of the boilerplate. CmdStan seems like the most straightforward approach for a first pass at the interface. I’ll open an issue now and start the pull request. I’ll tag you in the issue and PR. Thanks for all the guidance here, too.


Wanted to add to the guidance


  Unit tests are built through make by specifying the executable as the target
  to make. For a test in src/test/*_test.cpp, the executable is test/*.

  Header tests
  - test-headers  : tests all source headers to ensure they are compilable and
                     include enough header files.

  To run a single header test, add "-test" to the end of the file name.
  Example: make src/stan/math/constants.hpp-test

  - cpplint       : runs on source files. requires python 2.7.
                    cpplint is called using the CPPLINT variable:
                      CPPLINT = lib/stan_math/lib/cpplint_4.45
                    To set the version of python 2, set the PYTHON2 variable:
                      PYTHON2 = python

Knowin these has saved a lot of my time.


This would be great.

Whatever you do, talk to @bbbales2, who is working on similar things for our metric (mass matrix) adaptation for HMC.

There’s also a pile of work by Tamara Broderick et al. on adjusting mean-field posteriors to account for covariance.

Both that and structured (block diagonal) would be helpful. The blog diagonal thing’s a little harder to imagine how it would be specified.

The place to add it as a service function in the Stan repo. Things get added there, then wrapped into CmdStan, RStan, and PyStan. We can help with the latter if the service function’s implemented.


I’ve got a draft of the service implementation and CmdStan addition and all seems to be working on that side. But partial template overrides not being possible for classes mean there’s a lot of code duplication. I also need to debug a lot of the Eigen code in the math which I’m not familiar with so it will probably be a week or two before I have a working version. And another couple weeks to sort out the code duplication issue.

I agree a block-diagonal form would be useful. It might also make way for a hybrid ADVI/point estimates procedure. I’ve implemented such models in TensorFlow and they are slow to compute the gradients so I think they might make a good addition to Stan.


C++ lets you partially specialize template classes.

Is that not what you mean?


Ah, yes, that is what I mean. I did that but because of the additional rank arguments the internal methods require some modification to avoid duplication and I haven’t worked out how to best do that yet. Still warming up to C++.