I am currently looking at a set of time series that are clearly more or less (but not equally) correlated.
Call them A, B, C, D, [up to 12 of these, each of 50-60 points, but I can make a principled retreat to 6]
I am trying to build a GP to model this set of time series.
I can define the basic input vector as (T, V) where V is an indicator vector for the time series, but this does not work. Seems obvious that this is because the various time series are differently correlated - A[t+1] is obviously not the same distance away from A[t] as is B[t], which is at a different distance than C[t), but, alas, cov_exp_quad assumes an equally weighted metric.
My idea (the obvious one) is to put a covariance matrix Q into the model, and replace the indicator vectors with the projections from the cholesky decomposition of Q. Then cov_exp_quad should work as before, but with a weighted distance function, and I can fit the covariance along with everything else.
Now I assume that I’m not the first person to encounter this problem, so here are my questions:
Is this the best way to do this? If not, what is the best way? (Is there another way? Am I missing something obvious?)
What is the best way to implement this - I can see at least two, and this is a big enough model that the computations cost is a material issue, so I thought I would mine the experience of people who have more of it than I do.
Is it that you have four processes that you’ve measured and you think they are all perturbations away from one big underlying GP? @rtrangucci has a big hierarchical election GP model here: GitHub - stan-dev/stancon_talks: Materials from Stan conferences (Hierarchical Gaussian Processes in Stan). I could be wrong but I vaguely recall it did something like that.
Could you describe a little more what V looks like?
V is just an indicator vector the length of the number of time series, with the indicator for the particular time series set to 1, and everything else set to 0
thus, e.g.,. we define X = <t,0,0,0,1,0,0> for the location of the value for the fourth time series from a set of six at time t.
I just saw multiple time-series and GPs so my first thought was multi-output GPs. You may want to look in to that. Alvarez has a good review on the topic.
In case you want to try with a stochastic differential equation approach, ctsem in R will make a stan model for it. Would be pretty close to the default model, so can post the code if you want.
install.packages("devtools")
library(devtools)
install_github("cdriveraus/ctsem") #cran version is due for an update...
library(ctsem)
n.latent <- 6 #number of processes
n.manifest<- n.latent #number of process indicators
varnames <- c(paste0('Y',1:n.manifest)) #n.manifest column names from data
dat <- longdatafile
time <- 1:nrow(dat) #unless time is already in data file
id <- 1 #modelling 1 subject
model <- ctModel(n.manifest=n.manifest,n.latent=n.latent,type='stanct',
LAMBDA=diag(n.latent), #fixed latent to manifest loading matrix
manifestNames = varnames, latentNames=varnames)
dat[,varnames] <- scale(dat[,varnames]) #otherwise priors may need to be adjusted...
model$pars[model$pars$matrix=='DRIFT' & model$pars$row != model$pars$col,'value'] <- 0 #not estimating directional effects between processes
model$pars$indvarying <- FALSE #because only 1 subject
fit <- ctStanFit(datalong = dat, ctstanmodel = model,iter = 500, chains=3,cores=3)
summary(fit)
cat(fit$stanmodeltext)
plot(fit)