Time series and Gaussian process

In stan user’s guide a latent variable GP is defined as:
F\sim multivariate\_normal(0,K(X|\alpha,\rho))
Y\sim normal(F,\sigma)

In my problem I have data for multiple years: t=0,1,…
So my model is:
YT_t\sim multivariate\_normal(YT_{t-1},K(X_t|\alpha,\rho))
Y_t\sim normal(YT_t,\sigma)
where YT is a vector of true values and Y is a vector of observed values.

This looks like GP decomposed for every year. Does such a model make sense? Are there any references I can look at?

In addition, is there a way to sample GP using MPI?

Thanks.

1 Like

Just clarifying: you have observations at multiple location on the variable X, but these are also made in each of several years, and you want the mean of the previous year to affect the mean of the subsequent years?

Well, kind off. There is longitudinal data of visually observed canopy for multiple trees for multiple years. I would guess that for each tree the true value of canopy at t=1 depends on predictors such as tree height but also on true value of canopy at t=0.

@linas Hey! So don’t take the “Stan Developer” tag so seriously. Not an expert. Don’t know much about trees, haven’t taken any serious statistics classes either.

I think it really depends on asking “what would I like to learn?

Common way to model time series with a GP, might not be useful in your case:

But in time series data, you could generate a covariance matrix k(x', x) using x =[1,2,3,...,T]', but if you use a squared exponential covariance function, and use for example, tree height as y, the GP time series model y \sim GP(0, k(x',x)) would just give you a smooth basis function that intersects the tree height at each point in time, which would make a pretty plot, but might not be particularly informative (you could plot the data and learn just as much).

Potentially more useful:

One thing that you could do… Say we want to learn something about what influences tree height. One simple model we could consider is an AR model, which would give us an average rate of change over time about how some covariate, say water (again, don’t know what canopy means without wikipedia), influences tree growth.

But hey, there’s over watering, and under-watering, right? I’ve killed plants by both over watering and underwatering them. Possibly a non-linear relationship. Could be a good place for a gaussian process prior on the AR regression coefficient. This way, we could capture that non-linear relationship that over watering, or underwatering, at time point T-1 has at time T.

So the “pseudo code” would look something like this.

For t in (2,3,…T):

  1. Generate the water covariance matrix for all trees at time T-1.
  2. Sample from the GP (multivariate normal), where f_{water} is one of the AR regression coefficients at time t-1, i.e. y_t = \beta_{water,t-1} + \eta, \beta \sim GP(0, k(x',x))

I’m definitely omitting/fudging some computational and statistical details, but the principal is there. Real model would require more work and thought. Also, this could potentially not be possible. I’m abusing notation too.

I hope this helps. It would help me if I could have more information. If the dataset is “open source,” I think it would be a good time to have a back and forth and write a model. Other people looking around could see how to do it them selves. Do you have a comprehensive list of what’s in the dataset? What would you like to learn?

Disclaimer: my suggestions could be awful!

It’s really on me that there’s no good examples on how to apply this tool. I’ve been slacking.

Definitely possible but I remember it being lots of work. I think we have GPU released to users for Cholesky decomposition, which will do a lot of the heavy lifting for GP models. The big bottle neck here is using Cholesky decomp to sample from the posterior. I remember it taking up something like 2/3 of iteration time for a GP regression. Also, I think the GPU was made easier for users because it abstracts all the computational details away. It’s really not a big deal though unless you have a giant covariance matrix, or a bunch of covariance matrices though. The GPU devs are mainly Rok, Steve and t4c1. They would know.

But yeah, the biggest question is what do you want to know? Also, what’s possible with the information you have.

Andre

2 Likes

You guys are very creative. The setting is very simple. We have ~ 100 trees and each year make observation how well each tree is covered by leaves visually. There are insects which leave larvae which eat the tree and destroy it. We treat some trees but not all. Objective is to generate predictive posterior for NEW untreated trees.

I have used GP as in my previous post and predictive posterior for EXISTING untreated trees does very well. Green vertical lines correspond to 95% CR, red x correspond to actual observations. Thin=100%-coverage by leaves. I am looking for even better models. In the latent GP example in Stan’s User Guide there is a small nudge in transformed data block delta=1e-9. I made it as a parameter to estimate - don’t know if this is OK. I can share the data if interested.

Sounds like a counterfactual prediction.

Can you be more specific? Always share the stan code or at least write out the probability model. This is a bit vague or unclear to me.

Really, I have no idea. Not my area of expertise and I’m still unclear as to what the model is. Can you specify all covariates and the exact latent GP model? This could mean non Gaussian likelihood, or could mean you specify a GP prior on a parameter. I’m still only guessing.

Sounds like you want to generate the cross covariance matrix, and generate the posterior predictive for out of sample data. So, perturb some covariates, and compute cross covariance between x and \tilde{x}_*, using the specified covariance matrix, and the prediction equations you can find in GPML.

Not sure, can you share the model?

Andre

EDIT: modeling spatial dependence may help here? Also deleted “but yes”

1 Like

Many thanks for helping me out.

GP.stan (4.2 KB)

Can you specify all covariates

For each untreated tree covariares are: indicator of tree height, density of treated trees in the 100m circle, density of untreated trees in the same area, distance to nearest treated tree, distance to nearest untreated tree.

Thanks for sharing. I’ll set a reminder and take a look tomorrow. It’s late Saturday night where I am and I’m not patient enough to look through this in detail right now.

Andre

1 Like

Can you also post the code for the plots? This will help.

Andre

Thanks for your help.

The code is below. This function draws the graph for each year. y_center=0, y_scale=1, la - posterior, PROBS=c(0.025, 0.5, 0.975), Ntree - # of trees for the year. As trees loose all canopy(leaves) they are declared dead and taken out of count.

plotFit <- function(Ntree, PROBS, la, year, y_scale, y_center, Y)
{
  q <- array(NA, dim = c(Ntree, length(PROBS)))
  for (tree in 1:Ntree) 
    q[tree, ] <- quantile(la[, 1, paste0("YT", year, "[", tree, "]")], 
                          probs = PROBS)
  xlim <- range(1:Ntree)
  ylim <- range(q, Y) * y_scale + y_center
  idx <- order(Y)
  plot(0, 0, col = "white", xlim = xlim, ylim = ylim, main = paste("Year", year),
       xlab = "Observation", ylab = "Predicted thin (%)")
  for (i in 1:length(idx)) {
    colCov <- "blue"
    if (q[idx[i], 1] <= Y[idx[i]] & Y[idx[i]] <= q[idx[i], 3]) colCov <- "green"
    lines(rep(i, 2), c(q[idx[i], 1], q[idx[i], 3]) * y_scale + y_center, 
          col = colCov)
    points(i, Y[idx[i]] * y_scale + y_center, pch = 'x', col = "red")
  }
}
1 Like

Yeah, I wrote some stuff down, but this is really out of my domain. All I can say is that since you’ve generated the all the covariance matrices separately, I believe you’re modeling them time independent. You might want to consider modeling the time dependency explicity, For the mathematical formulation, I think you could find answers in the Spatial Stats literature.

Also - I believe for plots you’ve produced, you’re using YT as the predicted values. You’re using the training samples predicted values. Obviously this performance will be good. One simple way to evaluate performance is to split up your data and test on values against true values, using generated predictions using the cross covariance from training and test data. I explain this is in a failed case study, you can find the equations and code there.

Scratch the other model.

Let’s try this. For simplicity, we first omit the time component. We can add this later. So our outcome of interests is canopy destroyed, or whatever be it. We should be able to handle this, omitting any time component with a plain-old, Gaussian Process Regression, with one covariance matrix. First, take a subset of your data and train the model on that. Forget MPI. The model was probably running slowly because it was ill specified anyway. Your observations are not so much that you need to use it any way (500x500 covariance matrix is fine, should be able to run in minutes).

You should be able to handle this with the code I’m uploading below, with no modifications.

After we get this working, we can figure out a way to include the time component, sound good? I really haven’t done this, but it’s a good learning experience.

gp_regression_ard.stan (1.7 KB)

And so we’re learning. What I think you’ve plotted, is the training error, which will generally be pretty high. The generated quantities block shows us how we can check to see how well our model works on, as you say, NEW untreated trees.

Thanks a lot for your help. I will give it a whirl.

Intuitively the canopy lost each year should depend on the covariates such as tree height, distance to treated and untreated trees (treated trees help to kill insects, untreated trees increase the probability of getting infected) which are constant year after year but also on initial conditions - how much canopy was left in previous year.

BTW, if interested, GP.data.r (57.1 KB)

I’m just happy to be useful. If the GP Stan code is too much I think BRMS has GP regression now.

Yeah I think the the former is straightforward to model. The later part (time dependency of canopy) I have no idea, but I’m happy to look into it.

Thanks, I was going to ask. Can you please just share the plain csv/txt file, not the R data file, with out all the “shard” stuff? I can’t promise anything immediately, because I have projects due, but I’m happy to help.

I’ll post all the code/plots of everything I produce, so that way you have the resources to produce it on your own.

Also might be useful is the document I wrote up in the first post here: Applied Gaussian Processes in Stan, Part 1. A Case Study

2 Likes

I am lurking here but we (as in me and work) are building out a similar model for canopy and cover. I hadn’t thought about using GP’s for this. So I am playing catch up. Hopefully I can contribute to this as well.

Are you working on EAB or something else?

data.csv (75.6 KB)
Note: just ignore first column. The second column is a time component so hopefully I will be able to use your code without modifications. Still I would like to compare my model with your model. How would you suggest?

Broadly we are working in riparian system with tanking groundwater and cm scale vegetation cover data. But we have a focused data set on saltcedar and tamarisk leaf beetle.

I’m not an expert. I would suggestion checking out the GP regression in brms, so that you don’t have to deal with the Stan code, writing these models is confusing and can end up erroneous if you’re not very sure what you’re doing. I really don’t think your model is implemented correctly. I’ve written some C++ code for this project, and I’m familiar with the basics of GP models. I don’t know anything about model validation, just basics. So I can’t really advise.

I think it’s best if I just show you.

A few questions:
Can you please describe in detail what all of the columns mean? relativeYear, desnU,… etc. Is this all in meters?

How do you measure how canopy is left, thin? And is minDistance the distance to a treated tree?

1 Like

I haven’t read the entire thread, but saw a few points I can comment on.

I guess you’re talking about the 1e-9 that is added to the diagonal of the covariance matrix. That is just to prevent numerical problems, it has nothing to do with the content or interpretation of the model. Therefore it should not be made estimable, but hard-coded.

If the value at t=1 depends on the value at t=0 and so forth, this is called an autoregressive model (because the value is a predictor for itself at a later timestep). The manual has a section about this model type. This is an different way to incorporate temporal autocorrelation into a model. Whether this fits your use case better than using a gaussian process, I don’t know.

3 Likes