An incoming sequence of natural language tokens from which I want to learn a language model, say `p(word[n] | word[n-1], ..., word[n - k])`

. My life would be a lot easier now had I grown up with continuous examples rather than natural language :-)

The beta is nice and conjugate (and we could use it for a symbol-level morse code example). We start with `beta(theta | alpha = 1, beta = 1)`

as a prior, and for each observation `y[n]`

increment `alpha`

if itâ€™s one and `beta`

otherwise. This is an easy online calculation. You can even generalize to a Dirichlet and condition on the previous words to do n-gram language modeling online. And this can be a component in training an online classifier. Thatâ€™s what we did with LingPipe.

Getting back to washing out, note that in this beta model, the order of `y`

doesnâ€™t matter. Observations are completely exchangeable. If I watched ten K-pop videos in 2005, it has exactly the same effect on the model as if Iâ€™d watched them today. They donâ€™t wash out so much as get dilluted in the same way that every new observation gets diluted as `N`

increases. So yes, they have overall less impact, but not less impact than the next observation.

Now thatâ€™s a different kettle of fish(ing boats). With a Kalman filter, you get a proper time series. Thereâ€™s a latent state representing the position of the boat over time and each observation is a noisy measurement of that position. The Kalman filter is nice because itâ€™s conjugate and so itâ€™s possible to update the posterior online as new observations stream in.

But the point is that itâ€™s a time-series model with a latent parameter for each observation denoting the true position. That lets it more effectively forget the past because everything is done relative to the last latent position.

You could do that with something like a time-series of preferences. This would wind up looking something like the dynamic topic models that Lafferty and Blei did.