Constraining Change in Estimates Over Time


I have a model that works for a response at a given timepoint. i’d like to generalize the model so that at the next timepoint, the amount of change in a parameter estimate is limited. I don’t think this is exactly the same as updating the posterior with new information, which I see has been asked before, because this approach weights new information more heavily and intentionally ignores that there may have been a cumulative million observations in the past that put a point estimate close to X and only 100 today putting it closer to Y. I want to let those 100 pull the point estimate closer to Y than they would if I treated them all as exchangeable observations.

Is there a way to do this either by automatically updating the center of the prior on the parameter estimate? Are there other approaches that might be suggested?


Have a thumb through: and see if there’s anything you like.

That book will explain how you build models where you have an underlying latent process that is changing over time and you produce measurements off this. Something like:

y_n \sim \mathcal{N}(\mu_n, \sigma_y)
\mu_{n + 1} \sim \mathcal{N}(\mu_n, \sigma_{\mu})

The first line says you measure data y_n at each day and it is distributed around some mean value, \mu_n.

The second line says the mean at day n + 1 is normally distributed around the mean at the previous day.

Depending on how you build your model, you might want to estimate some or all of \mu, \sigma_y, or \sigma_\mu. The idea is the mean can change day to day (or whatever your time units are). That make sense?


Yes, that is very helpful. Thank you! This is definitely the right concept. I’m not quite sure how to implement it in STAN, though. I suppose I can store the point estimate from past runs in R and then pass that value to the STAN file next time it is run? I think that would work, but it doesn’t help someone who wants to call the STAN model from another interface. Is there a way to do this all within STAN?

So that book mixes up the models with the inference a bit, because if you’re careful about how to set up the problem there are analytical ways to get the answer (or approximations that are useful in different applications).

You’ll need to read the book and extract the probabilistic models and code them up yourself to use em’ in Stan.

Look at the AR time series models in the manual (starting on page 162 of the 2.17.0 manual). Look at how the models (the y_{n+1} \sim N(a + b y_n, \sigma) part) translates to code.

If you do a latent state thing, the \mu s will need to be parameters as well unless you can do one of the filtering tricks in that book.

Do some experiments and take your time. This stuff can be really confusing at first.

You should be able to code these things up as single Stan models. Hopefully no need to do multiple model runs.

The multiple model runs are unavoidable, because I have new data coming in. That’s the part that makes it tricky (for me). It seems much more straightforward to say my estimate for 1987 is a function of my estimate from 1986, because then I could hardcode a certain number of years into the model, but I have to be able to run the model on 2017 and say when 2018 is over, “Here are my new estimates.” And be ready to do it again in 2019 and an indeterminate number of years going forward. I apologize if I didn’t explain that well before, but that’s the piece that’s causing me headaches.

Mathematically Bayes works that way, but practically you usually just rerun the inference with all the data. They’re the same thing, and it’s not easy to use samples from one inference in the next inference.

Do you need to do this for performance reasons?

That’s probably the main reason I can think of right now. Partially it’s just been the status quo to only run the (existing frequentist) model on the latest timepoint. Running the STAN model for a given timepoint is already 20ish minutes and I have ~10 timepoints already recorded, so running it on all of the timepoints seems prohibitively slow.

So how frequently would you be collecting new data and re-running your model?

It’s not a bad idea to just fit your model at all your different time points and look how things are changing (that’s the secret weapon in Gelblog-speak: ).

I think what you’re getting at is you can make plots like this but they seem super noisy so you want to smooth them out. You totally could take the separate posterior estimates and do some sort of time series model on those, but that’s totally a hack and you’d need to do some measurement error model stuff and make some assumptions to get this working.

But asking about how to limit how far the parameters change between fits makes me worry there is some subtlety to this issue. In any cases with subtlety, it’s nice to just do the full model so you aren’t dealing with the complexity of stacking different algorithms together.

If it’s slow to run a full model, and impractical to just let the computer sit for a day and do its calculations, then maybe the best you can do is build the full model, see the results, build your approximation, see the results, and make sure they are roughly consistent with each other for a bunch of different simulated data?

Yeah, that might be the best I can do (parametrically approximating the previous run’s parameter estimates as a prior for subsequent runs).

Perhaps the missing subtlety is about external validity of findings. If I constrain the changes on parameter estimates (especially for rare events), the model as a whole remains useful to the client – and I truly believe more accurate, as well. If I don’t, they are prone to noticing wild fluctuations among these rare events, which can cause them to disregard the entire model.

Thank you very much for your help, by the way. I do appreciate it.

I was thinking about this again.

The easiest way to lump these models together is going to just be a hierarchical model. That gets rid of the ordering, but it’ll probably make everything easier to understand.


\mu_0 \sim N(...) <- Shared intercept
\mu_n \sim N(0, \tau) <- Per-timestep difference from intercept
\tau \sim N(0, 0.1) <- Hierarchical prior, constraint \tau > 0 and pick a suitable standard deviation
y_n \sim N(\mu_n, \sigma)

This will be easier to code up without bugs and understand, though you lose the time series component (you’re just lumping everything together).

I’m not sure I followed this last bit, though thank you again for helping me out. Maybe this is just a typo on your part of maybe I’m misreading it somehow, but haven’t you made the response centered around 0? Even if you replaced y_n ~ N(/mu_0 + /mu_n, /sigma), it’d still be the global plus that one timepoints change from the previous timepoint, wouldn’t it? It’d have to sum all the previous changes, and I’d think that gets to be a very expensive set of coefficients to estimate over time.

Perhaps a related question, does STAN run in O(parameters*observations) or is it something higher?

Whoops, you’re right.

y_n \sim N(\mu_0 + \mu_n, \sigma) is what I meant

Yeah, it’d be expensive.

It’d be a global value plus a per-timepoint offset.

The computational complexity will be controlled by the complexity of evaluating the target log density of your model and the number of samples you want to generate. The automatic differentiation adds time, but I don’t think it changes the complexity.

That said, more complicated models usually need more MCMC samples, and each of those samples may be more costly to get (each sample requires integration of an underlying Hamiltonian system – by default it may integrate up to 1024 timesteps).

It’s hard to get a good sense of how the inference would scale between types of models. It comes down to “do I have enough samples?” which is just a complicated question.