Scaling Bayesian updating (old posterior to new prior)

In the framework of industrial applications, I have a process where tools T are dedicated to execute a repetitive task, and when the tool is worn out it is replaced by a new tool.

The evolution of a tool can be described by a model that output the state of the tool y (almost never directly measured) given available information x, that is,

P(y \mid x, \theta_T),

where \theta_T are the parameters of the tool. Unfortunately, to fit the model much information is missing, so I include it as parameters of the model, which considerably increases the dimension of the sampling space, but Stan still samples somehow fast.

To forecast the performance of a tool T using the model, I use the posterior P(\theta \mid D), where D means data. The model is hierarchical, and the parameters of the tool \theta_T are sampled from another distribution that depends on the parameters \bar{\theta}. In the ideal case, \theta_T should be independent of the tool, but some variability has been observed, and it is experimentally unfeasible to understand that variation, so my plan is to use data from the factory.

To successfully deploy the model, I would like to update my posterior with the data collected after each tool is discarded, assuming the tools are drawn independently, so

\pi_{\mathrm{new}}(\bar{\theta}) \propto f(T_n \mid \theta_{T_n}, \bar{\theta})\pi_{\mathrm{old}}(\bar{\theta}).

The problem is that the input of Stan are smooth functions and the output are samples, so I have to convert samples into a smooth probability distribution.

A solution I have heard is to use some density, like a Gaussian, and adjust, but I do not think this hack would do the work. Another solution is to fit the whole dataset; unfeasible. I found two other threads, this and this, but I feel they do not totally address my problem.

I came across a paper on image processing where the prior density function is learned using samples (great!). One of the methods is Normalizing Flow (NF), which I realized belongs to the family of methods (distribution learning) containing GANs and VAEs, but NFs seem better to me because it is easy to compute the probability density, which Stan uses.

I still have many doubts (excuse me if I ask nonsense), but I feel sort of lost.

  • How much is it known in the Bayesian community about the use of normalizing flows for online learning (in my case, samples (tools) are independent, no sequential learning)?
  • How much is it known about the use of distribution learning in general (not only NFs) to address the problem of posterior re-use?
  • Is Variational Inference better for me instead of MCMC? Is the output of Variational Inference a smooth function I can reuse? I have heard that VI is not robust, and I fear some degradation of information at each update.
  • I have seen that Pyro, an alternative to Stan, has some modules for NFs. It hints at the fact that the “my” problem and a solution is already well known and studied within the community. What do people know?

I do not have any advice for fponce, but just wanted to share that I have a very similar problem. I have an ever-expanding pool of data, a model that needs to see all the data at once, and I need to produce estimates from new data in a relatively short period of time (10 min as an upper limit before people start complaining, I’d guess). Eventually, there will be too much data, and I’ll run out of tricks to improve performance and money to buy more compute.

I’m not really sure when that day would come, but it would be nice to have some approach that can produce pretty good estimates in the short term, using previously computed posteriors as priors, and then compute the full posterior from all the data at a later time.

HMC scales very well in dimension. It has a harder time with correlated posteriors because we only do diagonal mass matrix scaling by default and it’s hard to extend to dense mass matrixes in high dimensions or where there’s varying scales, as in a hierarchical model.

This sounds like a good motivation for a hierarchical model.

Right—there’s not a general way to update a prior into an analytic posterior unless the model is fully conjugate. The easiest thing to do is to just re-run the model each time with more data. If that’s infeasible, then you probably want a solution that doesn’t involve HMC, such as sequential Monte Carlo. You can warm start draws, step size, and mass matrix.

You could do this—it goes by the name of “emulators” or “simulation-based inference”. Or you could do what the ML folks call “amortized inference”, which is to train a neural net to map from data to parameter estimates directly. Any of these things might be reasonable to do in cases where either (a) MCMC doesn’t scale, or (b) you’re willing to trade accuracy for speed in production.

I’d say it’s just making inroads. I know a lot about them, but we’re doing research on them here. Just not as part of Stan.

It’s not very common. The general area you’re looking for is kernel density estimation. It’s more common to turn to sequential Monte Carlo or just stick to conjugate models.

There are a lot of VI methods. The ones in Stan are not very accurate, but you could use normalizing flows as the approximating family.

I would suggest asking on the Pyro forums. I don’t know enough about it to comment on where they’re at. You probably want NumPyro, which outputs JAX code, but I’m just guessing here.

1 Like

We’ve seen questions like this several times on the forum (including in the posts linked by the OP), and of course there’s no fully general solution. However, it seems to me that in many cases it should be possible to do these updates via PSIS. In particular when the new data are:

  • small
  • not too influential
  • don’t introduce new parameters (like new random effect levels in models where they cannot be readily integrated out)

then you could do these little updates repeatedly, and once you accumulate enough of them that the data are no longer small/uninfluential you could at that point re-fit the model and begin a new cycle of fast PSIS updates as further data come in. By running unusually large numbers of posterior iterations, this process could be made quite robust.

Does anyone know if there’s already a package in the Stan universe that readily achieves this? Or some reason I’m overlooking why it doesn’t work? @avehtari