Dear forum,
I am still quite new to Stan and I am trying to model time dynamics in longitudinal data using Gaussian Process. In my data, I have N subjects that were sampled T times and for each time point I have P measures. For now I don’t have any missing data (and I know this will be a whole separate issue). I looked at Stan documentation (10.1), but still couldn’t figure out how to specify the model.
Any point to the right direction will be greatly appreciated.
Thank you!
Are you looking for a full Stan file of a GP model to use as a template? Did you check out section 10.3 of the user guide? Looks like section 10.3 of the user guide has a couple of those.
I have seen it, thanks! (I have spent many hours on this page) But I am having trouble relating my data to the examples…
So what you have is what I call a hierarchical GP scenario where you can do a GP for each of your N subjects and likely want to model said GPs as deviations from a mean-across-subjects GP. Before looking at a full hierarchical model, you should first get a feel for what it would be like to do a GP for a single subject. I have a demo here for that. Then, for the hierarchical case, take a look here.
Note that in those examples I deviate from the Stan User Guide’s recommendation to parameterize the GP in terms of lengthscale, instead using inverse-lengthscale (a.k.a. “volatility”) but it should be straightfoward to re-parameterize if you prefer lengthscale. I should probably do this myself at some point as I understand that more expert folks than I have worked out what priors one needs for lengthscale to make GPs behave well.
Thanks! In your gp_regression.stan code, would it be right to say that rows_z_unique could represent time?
It’s best if you use the “download zip” button, unpack on your computer, open the gp_regression_example.R
file and step through that, as the comments there will explain how the model is being set up. It’s actually structured to accommodate more complicated designs than you’ve described so far where there is some set of conditions in which each subject is measured, and that’s what’s the z
matrix business is all about. For your case, you’d just have an intercept-only contrast matrix, generated as:
z = model.matrix(
data = dat
, object = ~ 1
)
In both models, x
would correspond to time in your data.
Ty! It’s going to be a long day of code digging. Thanks again.
Oh, and you mentioned missing data in your original post. Note that the way I have things set up, it automatically accommodates missing data as it estimates a latent noiseless GP that is then sampled with Gaussian noise; when you’re missing data for a timepoint, you still get updating on it’s latent value thanks to it’s surrounding timepoints.
Cool, that’s super helpful! Do you have a paper describing your model that I can refer to?
I developed this approach myself while advising my friend, who wrote up his work in a Masters thesis here, and it looks like they also published here, though that seems to be just a conference abstract.
You might also find https://jtimonen.github.io/lgpr-usage/index.html useful if you don’t want to code things up in Stan yourself
Cool, I meant to look at lgpr when it first came out but never found the time. It does hierarchical scenarios? (a mean function with multiple individual deviation functions?)