Time series and Gaussian process

It’s because the model isn’t specified correctly. I was being nice earlier, but the model doesn’t make sense.

I’ve defined how to use GP in time series data above, with time data being one of the X dimensions. I’m happy to show you how to fit an informative GP regression model, but it would be helpful to me if you’d describe your data for me, as I’d like to know more about it before I estimate a model.

Thanks, this is why I wanted to get opinion from experts. And I appreciate your help. What is your opinion about GP with errors in variables?

Relative year - is the year from when experiment starts: 0-5
DBH - proxy for tree height in inches
There are two types of treatment: T and X.
DistU - distance to nearest untreated tree (in m)
DistT - same to treated tree with T
DensU - # of untreated trees in the 100m radius
DensT - same with treated trees with T
DistX and DensX is as above to trees treated with T.
EBYear - relative year of treatment with X.

I am really happy that with the progress.

EDIT: I am trying to run the gp_regression_ard.stan and stuck with gp_exp_quad_cov. I don’t have it in rstan 2.19.2. It is not in cmdstan 2.21.0 either. Replaced with GP1.stan (944 Bytes) Will post results.

Thanks I’ll get back with you in a week or so.

If you have a separate term for the observation error, than it probably doesn’t make sense to have a GP with nugget, because the nugget does the same thing as the observation error. At least if you have a linear link function and gaussian error distribution. In models where this isn’t the case, such as a poisson model , the effects are still somewhat similar, but not identical. Therefore an error term and nugget together can make the model unidentifyable or very difficult to sample from depending on the model structure.

Not sure what you mean.

1 Like

I meant that parameters in AR(1) may be modeled as GP. Just a thought…

Why ?

Well, for each tree the change from year to year is governed by covariates that don’t change between years so the \alpha, \beta in AR(1) should be constant year to year. But the change is too random to be explainable just by normal noise. So I thought to model the change as GP. In fact I tried AR(1) without a lot of success. Frankly I am looking for a model that makes sense, describes the data well, and makes good predictions for urban forest. Therefore I am open to various models. Unfortunately I am unable to try all of them…

Really strange. Can you upload your version, in command Stan of ‘/stan/src/stan/lang/function_signatures.h’?

Should be there, I’ve added this to the language many months ago. It should be there unless someone’s deleted it.

Hi @linas can you please also tell me if you remember whether you’ve cloned or installed from tar for command Stan?

@linas

Earlier, you mentioned that when a trees canopy starts to die, you say that the tree is dying and you stop taking measurements.

  1. Do you have a measurement of canopy lost?

  2. Do you have a unique identifier for all of the trees? Typically in this case we can use a time varying survival model, which would lend it self well to use a GP prior. But usually, we have identifiers for the trees (or patients). If you don’t, it makes it hard. I could use an ad-hoc metric, such as tree height, to determine which tree is the same, but this is not ideal, since the trees probably grow each year, and due to measurement error, etc. So I’m a bit crippled without this information.

In summary, canopy lost would be nice, but not crucial if we have unique ID’s for the trees. These facts change how I need to model the data.

I think it is my fault. I usually check for syntax errors with stanc_builder in R and my rstan is 2.19.1. Sorry for the false alarm.

1 Like

@linas

No worries! Just have to push to get my software out. More important is the question about the trees data, above.

Also - I notice you’ve scaled the data (years and tree heights are in inches). It would be beneficial for me if I could have the data on the original scale (I’ll rescale and un-rescale accordingly).

We call it thin=canopy lost. It is a visual observation. Thin is 0, 10, …, 100.

Yes. Survival model would be awesome.

Relative year = 1 through 5
dbh is in inches
distance in m
density is in 1/m^2
EByear is 0 or 1. Large trees are treated every other year. The insecticide is very strong and insects die instantly. Trees < 15" are treated each year with different insecticide which is not so strong and insects just get drunk/lost. EByear is the year strong treatment is done.
Data is attached. Thanks for looking into it.
data.csv (35.0 KB)

EDIT: Can you please critique why this model is miss specified?
YT_t\sim multivariate\_normal(YT_{t-1},K(X_t|\alpha,\rho))
Y_t\sim normal(YT_t,\sigma)
where YT_t is the vector of true values of canopy loss for each tree while Y_t is the vector of observed values of canopy loss for each tree, t=1,…,5, YT_0 is an estimatable parameter vector, \sigma observation error, \alpha is the magnitude of sq. exponential kernel K, and \rho is a vector of length scales for each covariate. X_t is a design matrix for each year t.

1 Like

Great, thank you for all of this additional info. It’s really helping me understand the problem. Especially the weaker insecticide, anything is helpful.

I know this is not the dataset, but it would help if we had the relative day, you can excluded identifying information but it would be more realistic if we’re modeling time dependency and we have the day/month. I’ve made the mistake of assumed even spacing when it wasn’t, and it made some silly inferences, we better include this.

Also - I’m at a phone right now. Haven’t looked at the csv yet, but 5 is 5 years ago, 4 is for years ago…etc, correct?

Yeah so a few things. There’s a lot of small details that can go wrong if you’re not careful. So the written model has a few weird things. Usually capital Y is a matrix. And YT would be matrix multiplication. So that’s a notation mistake. Also, generating your covariance matrix, you have K(X_t) as covariance matrix and (correcting the subscript mistakes, I think you mean) Y_{t-1}, so we’re modeling the past with the covariance of the future. Which won’t be possible until we figure out time travel. So that doesn’t make sense.

But an AR model with the covariance of the past
I don’t think is unheard of. I think I flipped through INLA to check out some models and (I think) there was something like y_t = f_{t-1} + \eta, so I don’t think that’s unheard of, if that’s what you meant to write down.

But yeah the the second like just doesn’t make sense. Like, we have matrix multiplication with T as the mean function? And what’s the different of all Y and Y_t? No idea. Not to say your idea isn’t possible. Just the math isn’t accurate to me. And how it interacts with the other part doesn’t really make sense.

So - Sometimes it helps to just write it in words and then we write the model accordingly.

Ok… and then, if that’s what you meant, and we have the code you posted above, the code totally doesn’t at all correspond to your model. After the code corresponds to your model, we can can use the loveliness that is Statistical Computation to realize that our model is not unified with the data generating process. Indicators that your model is not unified with the generating process are: a) prior predictive checks, b) posterior predictive checks, c) the folk theorem, d) and of course divergences/HMC diagnostics (convergence, effective sample size, etc).

And I can elaborate on 1-4 if you need, I find the vocab confusing sometimes.

So in summary for all of this so be fruitful we need:

  1. specific research question and understanding of the data, and collecting, and process
  2. a model that answers the research question and is mathematically accurate
  3. to satisfy the Bayesian computation
  4. model checking

(3,4 are a,b and c).

Which is cool because 3-4 can be an indicator if there’s something not quite right in step 1-2.

EDIT: I’m looking at this now. So EBYear is an indicator of whether the tree was big, and thus treated every year, so if it’s 1 it was treated this year (and also big) and 0 otherwise?

EDIT 2: Along with exact date of administering the pesticides, do you have the trees location geocoded? That would be very helpful.

EDIT 3: I realize you’ve answered my question about years. But also - I’m assuming treatment was administered at the time you went out into the field to take measurements? How and when did you administer the pesticide?

Date is in Aug each year.

Thanks a lot for the critique. Yes, my intention was to answer 1-4. The research question is to take completely new placement of trees and model tree health over time assuming treatment with strong insecticide every other year and weak insecticide every year and assuming only a fraction of trees are treated.

Yes, idea for this formulation came from AR. I have modified code a bit but would would like to have a model that makes sense. YT and Y are vectors while K is a kernel (matrix). Sorry, I didn’t follow the convention.

If you could I would appreciate with leading through a-b-c-d. Mixing visually is good but I would like to generate all other key items.

Once more, thanks for your help. I am curious if survival models would be a good choice - I am not an expert. BTW, the normal error may not be realistic - the error is low when thin is close to 0 or 100 and is highest when thin is 50. I have tried beta but data has 0s and 100s. Zero one inflated eta maybe?

You’re not asking the right questions. Should be “what is the objective and what do we want to learn?” And we build models around that.

I’m muting this post and I’ll tag you when I have the first part done, which is a model to predict thin.

First, think about what would like to know and draft a very concrete statement around that, and we work with that. I’ve been trying to pull information regarding ecology, but you keep getting distracted with modeling. You’re not thinking about what these models represent. For now, please describe what information we’d like to obtain.

If you draft a research statement, feel free to tag me. No longer than one page. 1-2 sentences per concrete goal.

Thanks for working with me, and giving me such an interesting problem.

Also, guys, never said statistical computation was easy, relax.

1 Like

Glad the problem is interesting.

There are many objectives. One of them is to predict what treatment is necessary to keep the canopy loss of untreated trees under control, i.e. that canopy loss after x years of treatment is under some threshold with a high probability (95%). While data is gathered for sites A/B/C we want to make predictions for some virtual city. My approach is to build the model of the canopy loss dynamics and then run Monte Carlo simulations for different tree placements in the city. I am open to ideas. We assume that we treat high trees with strong insecticide every second year and small trees with weak insecticide every year.

  1. What % of trees have to be treated in the virtual city to keep canopy loss of untreated trees under control?
  2. What is critical distance between treated and untreated trees for treatment have an effect on untreated trees?

Thanks for this response and your time invested in me, and thank you for the learning opportunity. I’ll spend my free time working on answering 1 and 2, and I’ll get back with you when I have something useful.

Yes, trees above 15" received strong insecticide every other year.

Yes, lat/long is recorded. Let me know how you will use it and I will try to provide this info to you.

No, the treatment was applied in June and observation was performed in August. The root cause for dying trees is larvae that eat the tree from inside so tree doesn’t get food. Insects are just producing the larvae. June is chosen for treatment because it is time insects are mating so by reducing insect population we effectively reduce larvae population. August is chosen for observation because it is good to estimate damage from larvae by that time. If you go further it is indirect measurement of larvae population. I think there is a way to think about it as controlling the spread of infectious disease.

In short, in a GP we can include lat/long as two dimensions in the process. Right now, I have density of treated trees in a radius of tree 21, but the radii could overlap, and this is potentially useful.

I still haven’t decided yet how I’ll use this information.