Does Stan support online learning? If so, could you point me to the place in the documentation that relates to this?
To clarify, online inference is when you use the posterior as prior in further iterations, which allows you to learn as new data arrives. Examples of where this can be useful include matchmaking in online gaming (update the user skills immediately after each match they play) and recommendation engines (updated the user preferences immediately after they buy an item / watch a movie / listen to a song, etc.).
Online learning is a cool concept!
In Stan, priors are coded using probability density functions (PDFs). Stan doesn’t return the functional form of the posterior distribution, only samples from the posterior distribution. That means if you wanted to take your posterior and then use it as a prior in another Stan model you’d have to somehow convert those samples to the functional form of some density. If for example your posterior is multivariate Gaussian, you can take your samples and fit a multivariate Gaussian to them and then use that multivariate Gaussian as a prior in a new model.
Now that I think about it, I suppose there’s another way you can get an approximate functional form. You basically have the functional form up to a constant of proportionality. That constant is actually an integral that can be approximately computed using your MCMC samples, so there’s also that.
Thanks for the prompt response! Is there an example anywhere of how to implement this?
Also, how fast would the conversion of the distribution representation be compared to the time spent in inference (on a single data point - it’s online)? This would probably depend on the model, but roughly speaking.
It’s also beautiful when you see it illustrated in a texbook for conjugate models.
Almost always when I see people want to do this, it’s in a context where the effect of history should decay. For instance, an obsession with Gangnam style a few years ago should have a diminishin effect on future music taste, whereas those last ten samba numbers should perhaps be up-weighted.
In the general case, it’s intractable. The only way to get the correct answer with a general model is to re-run. You can start with the old mass matrix, step size, and adaptation parameters, which should work for a while with new data until the posterior gets too concentrated.
@andrewgelman has been thinking about approximations.
Depends how you do the approximation of the functional form of your posterior, but as @Bob_Carpenter said the only way to do it exactly (not as an approximation) is to just refit the entire model with new data.
Whenever I hear about online learning the canonical example I think of is using a Kalman Filter to look for the position of a moving boat using noisy radar observations.
@Bob_Carpenter, doesn’t past history decay anyway when you get more data? Say a couple years ago I thought Gangnam Style was 9/10, we can represent that with a Beta(9,1) prior. Say that since then I’ve rated a bunch of new KPOP songs poorly, we can represent that with a Beta(10,90). Then my updated love of KPOP will be a now paltry Beta(19,91), and thus my taste from 2 years ago has been “washed out”.
Of course, my old observations weren’t weighted less in the Beta calculation, but I think the wash out happens regardless of if you down weight those old samples.
An incoming sequence of natural language tokens from which I want to learn a language model, say
p(word[n] | word[n-1], ..., word[n - k]). My life would be a lot easier now had I grown up with continuous examples rather than natural language :-)
The beta is nice and conjugate (and we could use it for a symbol-level morse code example). We start with
beta(theta | alpha = 1, beta = 1) as a prior, and for each observation
alpha if it’s one and
beta otherwise. This is an easy online calculation. You can even generalize to a Dirichlet and condition on the previous words to do n-gram language modeling online. And this can be a component in training an online classifier. That’s what we did with LingPipe.
Getting back to washing out, note that in this beta model, the order of
y doesn’t matter. Observations are completely exchangeable. If I watched ten K-pop videos in 2005, it has exactly the same effect on the model as if I’d watched them today. They don’t wash out so much as get dilluted in the same way that every new observation gets diluted as
N increases. So yes, they have overall less impact, but not less impact than the next observation.
Now that’s a different kettle of fish(ing boats). With a Kalman filter, you get a proper time series. There’s a latent state representing the position of the boat over time and each observation is a noisy measurement of that position. The Kalman filter is nice because it’s conjugate and so it’s possible to update the posterior online as new observations stream in.
But the point is that it’s a time-series model with a latent parameter for each observation denoting the true position. That lets it more effectively forget the past because everything is done relative to the last latent position.
You could do that with something like a time-series of preferences. This would wind up looking something like the dynamic topic models that Lafferty and Blei did.