Any work on (or pointers to) Bayesian updating with "forgetting"?

I’ve developed some code that does periodic Bayesian updating: fitting a beta distribution to [0,1]-bounded time series data. I implemented “forgetting” at each update by expanding the posterior’s (phi and lambda’s) SDs as the posterior becomes the prior for the subsequent iteration. The goal is to adapt to any drift, but also to minimize potential overfitting to fluctuations in the data. I think it works great!

My post isn’t really “publicity”; rather, it’s a request for pointers to any similar work.

My questions:

  • Any pointers to work on Bayesian updating with “forgetting” by manipulating distributions?
  • Any other approaches that accomplish similar results? Perhaps somehow “de-weighting” older data’s contributions to the posterior distributions? There’s a ton of time series work for de-weighting older observations for computing means. (E.g, exponential smoothing.) But are there ways to “de-weight” older data when computing a full posterior distribution?
  • Perhaps there are particular application areas (presumably that deal with time series data) where I might focus my literature search?

Many thanks in advance for any pointers!

(So far, all I’ve found is this post on non-parametric Bayesian updating by @robertgrant, which I confess is a bit over my head. It looks super-interesting, though – and I’m working through it!)

If you’re interested, a brief summary of my approach – and a few plots – follow. The steps:

  1. Generate fake [0,1]-bounded time series data drawn from a time-varying beta distribution.
  2. Divide the data up into sequential “chunks.”
  3. Fit a beta distribution to the first chunk (using Stan). Extract the posterior mean and SD of the two parameters, phi and lambda.
  4. For update #1 (with the 2nd chunk of data), use the previous posterior (as summarized by phi and lambda’s means and SDs) as the new prior, but with phi and lambda’s SDs multiplied by, e.g., 1.4. (Which diffuses the new prior, relative to the previous posterior.)
  5. Repeat steps 3 and 4 for each subsequent update on each subsequent chunk of data…

(Many thanks to @LucC for suggesting I use the phi/lambda parameterization; big help!)

And some plots…

In each plot I show three time series: the raw data, a sliding window approach, and “post”: the posterior using my “forgetting” approach. The top frame in each plot just shows the series’ means, to provide a rough idea of how the three approaches compare. Below are separate frames for the three individual series which show the distributions’ 80% Percentile Intervals – to give at least a general feel for the posteriors’ distributions at each update. (For the Percentile Intervals for the data, I just used a non-Bayesian estimation for each data “chunk.”)

In the first plot I’ve implemented “forgetting” via phi and lambda SD multipliers of 1.4. (You can see the results in blue.) There’s a nice smoothness to the “forgetting” (and thereby in adapting to the new data). But also a bit of resistance to the “perturbation” at iterations 16-18. (By comparison, the sliding window approach, in green, of course “forgets” rather abruptly at iteration #8.)

Increasing the SD multipliers to 1.6 increases the degree of “forgetting.”

As an aside, here’s one phi and lambda posterior plot. The left figure is a posterior directly from Stan. The middle figure is 4000 points sampled from Stan’s marginal summary statistics of that posterior. Not exactly the same, but “close.” (There appears to be a slight parameter correlation in the “true” posterior on the left.) And the right figure is 4000 points sampled from an expanded posterior.

Picture1

For comparison, here’s a plot with no forgetting, where new data has less and less impact on the posterior, due to the decreasing relative contributions of each subsequent “chunk” of data to each update’s posterior. (“No forgetting” in my code via just multiplying phi and lambda’s SDs by 1.0.)

A final aside is that the above plot is close to – but not identical to – just computing the posteriors at each iteration with the cumulative data to that point. (So there’s a little bit of loss from summarizing the successive full posteriors by only four marginal parameter values each…)

5 Likes

Hi, it’s nice to see someone else exploring updating! A couple of things come to mind at a first glance that might be helpful. If they are old hat, please don’t be offended.
First, I’ve not worked with this beta distribution parameterisation. Are phi and lambda posteriors correlated? If so, you need to account for that with a joint p[oste]rior and not just two marginal ones.
Second, are they really normal? How quickly do they go to being passably normal? What happens if variance inflation pushes them into impossible values?
Third, my experience was that scaling up the n of observations can cause trouble, as the likelihood gets very narrow but can be a bad fit to the (already high-n) prior. Worth checking out with big n.
Fourth, what happens when someone tries to do this with many correlated time series or lots of covariates, so there is a higher p number of parameters? That’s where the loneliness of high-dimensional space kicks in. I would guess you don’t have to answer it yet but it’s a question you will get asked for sure in due course.

Now, other work out there:
the idea of manually inflating a posterior SD to represent “forgetting” rings a bell, but I’m not sure where I’ve seen it. I’ll sleep on it and let you know if I remember. Maybe it’s from AI.
One thing I didn’t realise at first about old time series methods is that their roots are in analog signal processing. Exponentially weighted moving averages are a thing precisely because one could do it in the 50s with no more than capacitors and vacuum tubes. So, they are not necessarily the right ideas to emulate for Bayesian updating.

Anyway, interesting stuff, well done.

5 Likes

I believe this situation could potentially be modeled without introducing any ad-hoc mechanism of “forgetting” by reframing the problem as one where we’re trying to infer the latent (aka hidden) state of a time-varying process – at each timestep t the process is in some time-dependent latent state state s(t) \in S and emits an observation x(t) according to a Beta distribution with parameters that are a function of the current latent state s(t). In this case s(t) could be some parametrisation of a Beta distribution, say. Concretely, think of s(t) for a fixed t as being a vector of the Beta distribution parameters, say (\alpha, \beta) \in \mathbb{R+}^2.

In the case where the state space S that contains s(1), \ldots, s(T) is discrete, this kind of model is known as a hidden Markov model (HMM). In the case where the state space S is continuous, then I believe the model is known as a state space model (e.g. the Kalman filter is a particular instance of a state space model).

A complete definition of the model would need to include:

  • an observation model, which defines a probability distribution to sample x(t) \in [0, 1] from, as a function of the latent state s(t). I.e. \Pr( x(t) | s(t) ) . In this case we’d plug in the probability distribution function of the Beta distribution.
  • a transition model, which specifies how the probability of the latent state s(t) evolves in time, as a function of the latent state s(t-1) at the prior timestep. i.e. \Pr( s(t) | s(t-1) ). There are many options to define how the state evolves – and different choices could give you different families of models. The choice of transition model would govern how rapidly or slowly the latent system state s(t) is allowed to vary from the prior state s(t-1).

One side effect of doing this is that you’d define a complete generative probabilistic model for the situation, including the dynamics of how the process evolves over time, through the specific choice of transition model. Might not necessarily be something that is possible to compute efficiently, but might give more clarity about what the model or problem is.

In the HMM literature there are standard algorithms for estimation tasks such as estimating the current latent state from all prior historical observations, or estimating what the the latent state at some previous timestep incorporating all historical observations including the ones that were observed afterward. There are also standard approaches for estimating unknown parameters of the transition model or observation model given one or more sequences of observed data (e.g. the expectation maximisation algorithm).

One complication with trying to get some kind of (continuous) state space model working for this situation is that if we want each value of the state space s(t) to be Beta distribution parameters (\alpha, \beta) \in \mathbb{R+}^2, then we need to figure out how to represent a probability distribution over the parameter space \mathbb{R+}^2, so we can encode our uncertainty about the Beta distribution parameters at each time step. Ideally we could find some family of distributions that is closed under the operation of applying the transition model, and is also closed when conditioned on a new observation – perhaps a conjugate prior distribution for the Beta distribution, if there is such a thing.

I haven’t read much about (continuous) state space models so I can’t suggest any references that seem to be a good match for this exact situation. It might be the case that someone has already derived an elegant and efficiently computable state space model for this exact situation – maybe buried in electrical engineering literature.

A less elegant way to proceed could be to take a discrete approximation of the latent state space: e.g. if we arbitrarily fix a grid of n values of \alpha, \alpha_1, \ldots, \alpha_n and n values of \beta, \beta_1, \ldots, \beta_n, then we get a finite state space \{(\alpha_1, \beta_1), (\alpha_1, \beta_2), \ldots, (\alpha_n, \beta_n)\} containing n^2 elements, and then it falls into the framework of the HMM with a discrete state space. E.g. setting n=100 would give a finite state space with 10,000 possible choices of (\alpha, \beta), and an implementation of the HMM forward algorithm could update itself from hundreds of observations and spit out a posterior distribution in much less than a second. If there were also uncertain parameters in the transition model that also needed to be estimated, then perhaps this kind of naive grid approximation approach would prove infeasible due to the size of the discrete state space.

I spent a few months learning about HMMs last year, and two introductory texts I found helpful were Russell & Norvig’s AI textbook (specifically the chapter on temporal reasoning over time, which introduces Markov processes, HMMs and the Kalman filter), and also Rabiner’s HMM tutorial.

Not sure if this is a fruitful or practical suggestion, but perhaps it gives some potential connections to HMM or state space model literature.

8 Likes

Many thanks, @robertgrant! You’ve given me quite a bit to think about; very much appreciated! A few initial thoughts…

First, I’ve not worked with this beta distribution parameterisation. Are phi and lambda posteriors correlated?

In theory, I believe, the phi and lambda parameterization is uncorrelated. Whereas the alpha, beta parameterization is strongly correlated. And that’s actually where I got started on all this, writing an alpha, beta model that captured the alpha, beta correlation. (A good learning experience, as I’m new to Stan!) But then I couldn’t figure out how to pass along the posterior correlation to the new prior. My post about all this is here.

From a visual inspection, it looks like my phi and lambda posterior parameters are just slightly correlated, so I am getting a little information loss by converting the posteriors to marginal distributions only. This slight correlation may be due to the lingering effects of my initial prior (which I happen to specify, marginally, in alpha, beta space). I’ll do some testing to see if I can sort this out.

Second, are they really normal? How quickly do they go to being passably normal?

In general, probably not. And definitely not when the mean is close to either 0 or 1. This is something I certainly need to explore. Thanks! (Also, if both alpha and beta are < 1.0, the distribution is bimodal. But my “prior” for the data I’m considering is that that’s not going to happen.)

What happens if variance inflation pushes them into impossible values?

For now I just truncate. I need to figure out how and when this might cause my approach to misbehave.

Third, my experience was that scaling up the n of observations can cause trouble, as the likelihood gets very narrow but can be a bad fit to the (already high-n) prior. Worth checking out with big n.

I’ve done some very initial testing with large(r) n. But not much, and not with this in mind. I will!

Fourth, what happens when someone tries to do this with many correlated time series or lots of covariates, so there is a higher p number of parameters?

I haven’t thought about that at all! I will.

the idea of manually inflating a posterior SD to represent “forgetting” rings a bell, but I’m not sure where I’ve seen it. I’ll sleep on it and let you know if I remember. Maybe it’s from AI.

Thanks; appreciated!

@rfc, wow, thank you! I hadn’t thought at all about baking time variation into my model. An intriguing prospect! (I certainly agree with you that my approach is rather ad hoc; it would be good to somehow ground it within a more principled framework.)

I’m going to take some time to think through your suggestions, and will read through the HMM literature you’ve suggested. (I read a bit of Russel & Norvig long ago, but not with any of this in mind.)

Many thanks!

1 Like

You might find useful parallels in extended / unscented kalman filters or particle filters. As was already mentioned it seems easier to me to propagate Gaussian distributions across time and attach a measurement model.

4 Likes

I remembered, it was from Bayesian LSTMs. Look that up online and you will soon be in a confusing alternative universe of computer science experts, blended with hypemongers and bandwagon-jumpers. Which is which? That is a real classification challenge!

2 Likes

Thanks! And, LOL! From a quick search-and-skim, I think I can see how that might be. (I’ve never been particularly attracted to black box/mystery parameter models, myself…)

Many thanks! I appreciate the suggestions and encouragement.

+1 to @rfc’s suggestion. The advantage to building the variation into your model is that you can use standard Bayesian inference, which is well understood. What you’re trying to do is more similar to what’s known as sequential Monte Carlo (SMC), but they don’t try to tweak the priors manually—it just composes. SMC is more general than Kalman filtering, which relies on the conjugacy of multivariate normals to do stepwise inference.

I’d also recommend moving to either a log odds scale (log(p / (1 - p)), which is unbounded, or a log scale (log(p)), which is constrained negative and can make sense for time series with exponential decay. The lack of constraints makes it much easier to express the time series than the two linked beta parameters (even under a reparameterization of beta(a, b) in terms of a/(a + b) and (a + b).

HMMs are just a combination of a discrete mixture model where the mixture component selection forms a first-order Markov model. There’s also a chapter in our user’s guide: 2.6 Hidden Markov models | Stan User’s Guide

4 Likes

Just in case anyone stumbles on this, I’m now following up on the suggestions to try out Sequential Monte Carlo (SMC)/particle filtering, and have bumped into some questions. In particular, my initial attempts at SMC aren’t set up to learn posterior observational variances from observed data. I’ve programmed in the observation likelihoods’ standard deviations – used for importance weighting – as fixed parameters, so they’re not at all affected by data. But this means that my posteriors’ variances are essentially “programmed in” as priors, vs. having any accounting of variance in the observations.

At this point, I’m not using Stan, so I didn’t want to create a new topic here. Instead, I created a Stack Overflow post, “Particle Filters: any way to have them ‘learn’ latent variances”?.

If anyone on this forum has any suggestions for me, I’d be most appreciative.

@s.maskell has done plenty of work in this area, also with integrating SMC sampling into Stan, so he might be able to help you here

1 Like

@ssickels: I’m not quite sure exactly what you want to do but there’s a relatively well-trodden model used in the context of particle filtering that considers time-varying variances by making that variance part of the “state”: I suggest you look for papers that mention “stochastic volatility” and “particle filters”. Note that if you want to analyse a batch of existing data, you can handle such models in Stan (by considering the variance at each time-step to be part of the parameters). Particle filters are going to be useful if you want something that could process a never-ending stream of data (we are working to have a variant of Stan that can handle such scenarios but it isn’t openly available quite yet). Does that help?

Cheers
Simon

PS Thanks, @andrjohns, for tagging me.

4 Likes

@s.maskell: Many thanks for the input and pointers; appreciated! (And I second your thanks to @andrjohns for tagging you on this!)

One preliminary thing I’ll need to read up on are approaches for estimating variance in time series data – ideally based on some sort of moving window, I’d think. And, I’ll delve into your suggestion to include it in my model’s “state”; makes good sense.

Not sure if you’re still working on the topic, I just saw this post and think that our work (from ICLR 2020) might be interesting for you :)

https://openreview.net/pdf?id=SJlsFpVtDB

1 Like

@ssickels I’m very curious to know where this ended up. Let me know!