Modelling dependent time series with Gaussian Processes


I am trying to figure out whether the ‘Multiple-output Gaussian processes’ described in section 18 (p 265) of the Stan reference manual can be used to model coupled time series with unbalanced sampling as well (i.e., different number of observation and different spacing). Is anybody aware of a more detailed description / applications of the method described in the Stan reference?



It is possible, although it may be annoying not to be able to structure your observed data in a nice MxN array, and therefore (if I understand what the Stan reference is describing) it wouldn’t be possible to write the correlation hypermatrix as the f ~ MatrixNormal(...). So the description below is somewhat different from the one in the Stan reference manual.

Instead you would first concatenate your observations into a vector with length equal to the number of data points y = [y_1, y_2, ..., y_M] (and t = [t_1, t_2, ..., t_M]), and your K matrix would have size \sum_{i=1}^MN_i squared, instead of simply MN when N is the same for all tasks. Also there with this formulation there are (MN)^2 correlation terms in this large (2D) matrix in instead of MN^2 terms in the (3D) hypermatrix, it seems that this is because the MatrixNormal only takes into account task correlation at the same time points.

There’s a somewhat detailed description of this concatenated structure in this presentation; it includes a description of the matrix describing intertask signal covariance, but it may not include everything you need.
This other paper, discusses how to combine tasks with different kernels (in the simpler cases just different hyperparameter values for each task), which you may need if the kernels of the different processes can’t be assumed to have the same hyperparameters.

But with that done you could simply (i.e. simply conditioned on all that stuff before) model [y_1, y_2, ..., y_M] \sim MultiNormal(\mu, K).

Would it work perhaps to model this as 2 latent GPs, one reflecting the mean of the two series, and one reflecting the difference between the series, with a prior on the amplitude of the latter peaked at zero to reflect your feeling that they are coupled? If so, this gist has an example on how to code this. It should automatically handle the two series being sampled at different times.

1 Like

Thank you very much for the suggestion - I’ll dig into this!

It just occurred to me: by coupled do you mean that one signal (A) influences the other (B) with an unknown time lag? That would be a different scenario where you would want to model A as a GP and B as a GP plus an influence of A with an unknown lag.

Some of this structure may prevent you from using Stan’s cov_exp_quad function and instead have to actually write the loops over the matrix entries to get the right hyperparameters for each task (or maybe compute the submatrices and the join the blocks, but I didn’t try that). Defining the loop indices can be annoying for those concatenated vectors, but it’s doable with some patience. Good luck!

I’m not sure it makes sense to model the dependence of the time series as an explicit lag here, with gaussian processes each observation are correlated with all others as joint, multivariate normal samples, so that could be addressed with different hyperparameters or kernel types without having to explicitly add other effects.
Of course that depends on which physical process is actually being modeled, and whether a lag or any other parameter are of interest.

Okay, so it seems that this might quickly become a performance nightmare. I am now looking into doing something like because this nicely handles imbalanced data in the different dimensions.
The linear model for coregionalization does involve computing a Kronecker product though and this seems to be quite inefficient as not hard-coded in STAN (A homemade kronecker product function).

Any ideas on how to cope with this efficiently?

I’d tend to think that any performance issues are related to the actual size of the covariance matrix, not necessarily how it’s computed. The Stan reference manual first shows the covariance matrix being set up with a for loop and only later introduces the cov_exp_quad() function – I don’t know if there’s further optimization of that function, but it being C++ it’s not the worst to set it up as a loop in this case too without compromising performance.

You can definitely code the Kronecker product as a loop, you just need to add a submatrix index to keep track of the size of the previous block that was set up. That’s a simple way of doing it, but there may be smarter, faster, or maybe something hard-coded into Stan that would do the trick, I’m not sure.
Here’s something like it, although this assumes N is the same for all tasks and you’ll probably have to account for different N for each task, like you said.

matrix K(vector x, real[] hyperpars, matrix signalCovarianceMatrix, int N, int M, real sigma) {
matrix[M*N, M*N] K;
int sub_l;
int sub_k;

for (l in 1:M) {
    // possibly use specific hyperparameters for task l
    sub_l = (l-1)*N;
    for (k in 1:M) {
         // possibly use specific hyperparameters for task k
        sub_k = (k-1)*N;
        for (i in 1:N) {
            for (j in i:N) {
                K[sub_l+i, sub_k+j] = signalCovarianceMatrix[l,k] * cov_func();
                K[sub_k+j, sub_l+i] = K[sub_l+i, sub_k+j];

That should not be the performance bottleneck compared to the single-task approach, but if your data set is very big in either case it may be too intensive and some sparse approach may help.
I don’t know about the one you mentioned specifically, but there are approaches that use “control points” to constrain the posterior, so you only have to decompose a matrix of the size you would like based on the number of control points you choose (as long as it’s enough to constrain the posterior), but then it also involves computing the prediction for value where you have the actual observations, not the control points. It’s probably less costly than cholesky decomposing, but whether there’s a real gain in performance is probably more dependent on the size, again.

Until we get MPI and/or GPU-accelerated cholesky decomposition, GPs involving large (>100x100) covariance matrices will be a headache. In these cases I’ve been reverting to splines as an approximation. Demo here: Spline fitting demo inc comparison of sparse vs non-sparse

The last time I ran comparisons, cov_exp_quad was indeed faster than hand-coded computation of the covariance matrix. If you were dealing with GPs on a grid, you could get nearly the same performance as cov_exp_quad by using tricks to eliminate redundant computations, but still not quite as fast.

Well, that’s clearly a reasonable prediction, but it will also depend on other factors, like how much can you constrain parameter space with prior knowledge, and especially the amount of and how informative is your data.
I guess this is one of the promises of HMC, leveraging the gradient of the likelihood to propose more efficient transitions and increase the effective sample size and potentially reduced the total number of iterations needed.
That said I’m not sure where things are for actually using GPs with large matrices in practices, anyone else with previous experience care to weigh in on that one?