State-space, GP best parameterization & recentering

I’ve been looking at some of the examples in the manual and on the forums for time-series models and was wondering what the preferred way to parameterize these models is.

It seems that in some cases people use:
f ~ MultiNormal(0, K(x|alpha,rho))
and others use:

f[1] ~ normal(0,alpha^2)
for(i in 2:length(f))
   f[i] ~ normal(f[i-1], c(x[i]-x[i-1],alpha,rho))

It seems that these would work out to be the same, but I was wondering if there is a difference in computational effeciency in STAN? Also for some covariance kernels, eg Matern 1/2, there are sparse representations for the precision matrix and could use MultiNormPrecision.

Additionally, using the non-centered parameterization you could rewrite the second using f’ such that f~N(0,1). This seems very similar (equivalent?) to using the Cholesky decomposition of the covariance matrix.

I don’t have a good intuition on what would be best in terms of keeping parameters on unit scales, vectorization, and depth of the autodiff graph. I was going to start exploring these options for a model I am working on but wanted to ask here first if people have experience or reccomendations between:

Multinormal vs. Conditional Specification
Centered vs. Non-Centered Parameterization

1 Like

I just finished writing up a post to get some feedback on some stuff I was working on with regards to time series stuff: Approximate GPs with Spectral Stuff

I really don’t have much experience with this stuff. Just trying to get a grips on what other people are doing, so I am curious as well.

There is no absolute answer – each of those model implementations can drastically change your posterior geometry, and those changes will depend on the size and structure of your data. Ultimately you start with one then check for speed and, most importantly, all of the diagnostics (especially Rhat and divergences). If there are issues then you can try another implementation.

Do you have any suggestions/intuition about how the size/structure affects things? I’m guessing considerations in the hierarchical GP would be something about length scale of the data, amount of sampling noise, number of GPs, and number of data points per process?

I switched from the conditional noncentered to the multinormal centered taking advantage of the sparse precision matrix. For a small dataset it slowed things down by about 75%, with no noticeable increase in sample efficiency. I was a bit surprised since the with the precision matrix, the derivatives don’t have to propegate as far, but there are around twice as many multiplication operations which I guess slowed things down. However, without running this for the 24 hours or so on the full simulated data, I’m not really sure how things hold up…

Try to fit without any data. Try to fit with lots of data. Centering and non-centered transition between pathologies in the latent Gaussian structure and the top-level measurement model so you can understand when to apply them by studying those two extremes.

Thanks! I’ll give that a shot.