I’ve developed some code that does periodic Bayesian updating: fitting a beta distribution to [0,1]-bounded time series data. I implemented “forgetting” at each update by expanding the posterior’s (phi and lambda’s) SDs as the posterior becomes the prior for the subsequent iteration. The goal is to adapt to any drift, but also to minimize potential overfitting to fluctuations in the data. I think it works great!
My post isn’t really “publicity”; rather, it’s a request for pointers to any similar work.
My questions:
- Any pointers to work on Bayesian updating with “forgetting” by manipulating distributions?
- Any other approaches that accomplish similar results? Perhaps somehow “de-weighting” older data’s contributions to the posterior distributions? There’s a ton of time series work for de-weighting older observations for computing means. (E.g, exponential smoothing.) But are there ways to “de-weight” older data when computing a full posterior distribution?
- Perhaps there are particular application areas (presumably that deal with time series data) where I might focus my literature search?
Many thanks in advance for any pointers!
(So far, all I’ve found is this post on non-parametric Bayesian updating by @robertgrant, which I confess is a bit over my head. It looks super-interesting, though – and I’m working through it!)
If you’re interested, a brief summary of my approach – and a few plots – follow. The steps:
- Generate fake [0,1]-bounded time series data drawn from a time-varying beta distribution.
- Divide the data up into sequential “chunks.”
- Fit a beta distribution to the first chunk (using Stan). Extract the posterior mean and SD of the two parameters, phi and lambda.
- For update #1 (with the 2nd chunk of data), use the previous posterior (as summarized by phi and lambda’s means and SDs) as the new prior, but with phi and lambda’s SDs multiplied by, e.g., 1.4. (Which diffuses the new prior, relative to the previous posterior.)
- Repeat steps 3 and 4 for each subsequent update on each subsequent chunk of data…
(Many thanks to @LucC for suggesting I use the phi/lambda parameterization; big help!)
And some plots…
In each plot I show three time series: the raw data, a sliding window approach, and “post”: the posterior using my “forgetting” approach. The top frame in each plot just shows the series’ means, to provide a rough idea of how the three approaches compare. Below are separate frames for the three individual series which show the distributions’ 80% Percentile Intervals – to give at least a general feel for the posteriors’ distributions at each update. (For the Percentile Intervals for the data, I just used a non-Bayesian estimation for each data “chunk.”)
In the first plot I’ve implemented “forgetting” via phi and lambda SD multipliers of 1.4. (You can see the results in blue.) There’s a nice smoothness to the “forgetting” (and thereby in adapting to the new data). But also a bit of resistance to the “perturbation” at iterations 16-18. (By comparison, the sliding window approach, in green, of course “forgets” rather abruptly at iteration #8.)
Increasing the SD multipliers to 1.6 increases the degree of “forgetting.”
As an aside, here’s one phi and lambda posterior plot. The left figure is a posterior directly from Stan. The middle figure is 4000 points sampled from Stan’s marginal summary statistics of that posterior. Not exactly the same, but “close.” (There appears to be a slight parameter correlation in the “true” posterior on the left.) And the right figure is 4000 points sampled from an expanded posterior.
For comparison, here’s a plot with no forgetting, where new data has less and less impact on the posterior, due to the decreasing relative contributions of each subsequent “chunk” of data to each update’s posterior. (“No forgetting” in my code via just multiplying phi and lambda’s SDs by 1.0.)
A final aside is that the above plot is close to – but not identical to – just computing the posteriors at each iteration with the cumulative data to that point. (So there’s a little bit of loss from summarizing the successive full posteriors by only four marginal parameter values each…)