Any work on (or pointers to) Bayesian updating with "forgetting"?

I believe this situation could potentially be modeled without introducing any ad-hoc mechanism of “forgetting” by reframing the problem as one where we’re trying to infer the latent (aka hidden) state of a time-varying process – at each timestep t the process is in some time-dependent latent state state s(t) \in S and emits an observation x(t) according to a Beta distribution with parameters that are a function of the current latent state s(t). In this case s(t) could be some parametrisation of a Beta distribution, say. Concretely, think of s(t) for a fixed t as being a vector of the Beta distribution parameters, say (\alpha, \beta) \in \mathbb{R+}^2.

In the case where the state space S that contains s(1), \ldots, s(T) is discrete, this kind of model is known as a hidden Markov model (HMM). In the case where the state space S is continuous, then I believe the model is known as a state space model (e.g. the Kalman filter is a particular instance of a state space model).

A complete definition of the model would need to include:

  • an observation model, which defines a probability distribution to sample x(t) \in [0, 1] from, as a function of the latent state s(t). I.e. \Pr( x(t) | s(t) ) . In this case we’d plug in the probability distribution function of the Beta distribution.
  • a transition model, which specifies how the probability of the latent state s(t) evolves in time, as a function of the latent state s(t-1) at the prior timestep. i.e. \Pr( s(t) | s(t-1) ). There are many options to define how the state evolves – and different choices could give you different families of models. The choice of transition model would govern how rapidly or slowly the latent system state s(t) is allowed to vary from the prior state s(t-1).

One side effect of doing this is that you’d define a complete generative probabilistic model for the situation, including the dynamics of how the process evolves over time, through the specific choice of transition model. Might not necessarily be something that is possible to compute efficiently, but might give more clarity about what the model or problem is.

In the HMM literature there are standard algorithms for estimation tasks such as estimating the current latent state from all prior historical observations, or estimating what the the latent state at some previous timestep incorporating all historical observations including the ones that were observed afterward. There are also standard approaches for estimating unknown parameters of the transition model or observation model given one or more sequences of observed data (e.g. the expectation maximisation algorithm).

One complication with trying to get some kind of (continuous) state space model working for this situation is that if we want each value of the state space s(t) to be Beta distribution parameters (\alpha, \beta) \in \mathbb{R+}^2, then we need to figure out how to represent a probability distribution over the parameter space \mathbb{R+}^2, so we can encode our uncertainty about the Beta distribution parameters at each time step. Ideally we could find some family of distributions that is closed under the operation of applying the transition model, and is also closed when conditioned on a new observation – perhaps a conjugate prior distribution for the Beta distribution, if there is such a thing.

I haven’t read much about (continuous) state space models so I can’t suggest any references that seem to be a good match for this exact situation. It might be the case that someone has already derived an elegant and efficiently computable state space model for this exact situation – maybe buried in electrical engineering literature.

A less elegant way to proceed could be to take a discrete approximation of the latent state space: e.g. if we arbitrarily fix a grid of n values of \alpha, \alpha_1, \ldots, \alpha_n and n values of \beta, \beta_1, \ldots, \beta_n, then we get a finite state space \{(\alpha_1, \beta_1), (\alpha_1, \beta_2), \ldots, (\alpha_n, \beta_n)\} containing n^2 elements, and then it falls into the framework of the HMM with a discrete state space. E.g. setting n=100 would give a finite state space with 10,000 possible choices of (\alpha, \beta), and an implementation of the HMM forward algorithm could update itself from hundreds of observations and spit out a posterior distribution in much less than a second. If there were also uncertain parameters in the transition model that also needed to be estimated, then perhaps this kind of naive grid approximation approach would prove infeasible due to the size of the discrete state space.

I spent a few months learning about HMMs last year, and two introductory texts I found helpful were Russell & Norvig’s AI textbook (specifically the chapter on temporal reasoning over time, which introduces Markov processes, HMMs and the Kalman filter), and also Rabiner’s HMM tutorial.

Not sure if this is a fruitful or practical suggestion, but perhaps it gives some potential connections to HMM or state space model literature.

8 Likes