Hierarchical spatial GP with uneven sampling coordinates

Hi all, for starters I’m an ecologist who happens to use Stan, so what might be obvious to a statistician might not be for me. Also, this is my first real question here…

I’m working with some forest data and we’re interested in estimating the spatial autocorrelation of samples within the plots. Some of the samples will be collected at the same coordinates in each plot, but other samples will be surveyed in the whole plot and have unique coordinates.

I simulated count data and assumed that it would be collected at fixed coordinates. I used Hierarchical Gaussian Processes in Stan (mc-stan.org) as a guide and it worked beautifully. See below for the model.

  int<lower=1> N; //total number of observations
  int<lower=1> L; // number of sampling locations in each plot
  int<lower=1> P; //number of plots
  int<lower=1, upper=P> plotID[N];
  int<lower=1, upper=L> locID[N];
  int<lower=0> y[N]; //response variable 
  vector[2] coords[L]; // coordinates
  real a0;
  //hyperparameters for ap
  real<lower=0> sigma_ap;
  vector[P] z_ap;
  //GP parameters
  matrix[L,P] zGP; 
  real<lower=0> eta;
  real<lower=0> rho;
transformed parameters{ 
  vector[N] logmu; 
  vector[P] ap;
  matrix[L,L] KL;
  matrix[L,P] GP;
  // Define the GP matrix
    matrix[L,L] K;
    K = cov_exp_quad(coords, eta, rho);
    K =  add_diag(K, 1e-9);
    KL = cholesky_decompose(K);
  GP = KL * zGP;

  //plot-level varying intercept, non-centered
  ap = sigma_ap*z_ap;

  //main model
  for(i in 1:N){
    logmu[i] = a0 + ap[plotID[i]] + GP[locID[i], plotID[i]]; 
  a0 ~ normal(1.5, 1);
  sigma_ap~ normal(0, 1);
  z_ap~ normal(0, 1);
  to_vector(zGP) ~ normal(0,1);
  eta ~ normal(0, 1);
  rho ~ normal(0, 2);
  y ~ poisson_log(logmu);

My question is, how would I write the model so that I could fit data that aren’t sampled in the same location every time? Also, for the data that are sampled at fixed coordinates, how can I accommodate values that are NA?

Thanks in advance!

1 Like

I think the answer to the second question also addresses the first, and the model is already set up to accommodate both. That is, you can pass a single coords data variable with all unique locations, in which case GP gets computed as a latent quantity that contains the model predictions at every possible location for each level of plotID, and then only actually-observed combinations of plotID and locID are drawn into the likelihood by way of the indexing into GP in this bit:

  //main model
  for(i in 1:N){
    logmu[i] = a0 + ap[plotID[i]] + GP[locID[i], plotID[i]]; 

You’re right in that the NAs don’t pose any problems, which was a big relief.

But for data that aren’t sampled in the same location every time, I’d run into memory issues if I used this model structure to record all unique locations. My solution is to use a NNGP model.

1 Like