Hilbert space GPs with distance matrices

Hi everyone,

I’m working on a model with species occurrence records and I thought it could be a good idea to account for spatial autocorrelation of the records using GPs. Our dataset has close to a million observations across 200 species so I think I will have to settle for the Hilbert space GPs. However, there’s one thing I’m uncertain of.

The Hilbert space GPs in Riutort-Mahol et al. 2023 describe uni- and multidimensional cases. My input has to be a distance matrix computed from the longitude and latitude of species occurrence records. Is there a way to fit the HSGPs using distance matrices? Is the multidimensional case applicable here?

It was suggested to me to tag @avehtari for this.

Thanks a lot!

Matt

Not really (except by mapping back to some coordinates).

That’s a lot, and this might be challenging for Stan with any spatial model. You could check if INLA software would be applicable (usually can handle much bigger spatial models/data than Stan)

You might also check out the sfJSDM() function from package spOccupancy Single-Species, Multi-Species, and Integrated Spatial Occupancy Models • spOccupancy

Disclaimer: I am not a user of this package, but I understand it to be useful in scenarios such as this.

1 Like

Thank you! I ended up speaking to Gabriel Riutort Mayol and he said the following:

To use the geographic Euclidean distance between points as an input to the HSGP model (and regular GP as well), you just need to implement a two-dimensional HSGP with latitude and longitude as the two input variables. Note that the most common way to include an input variable in a covariance function of the Matern class is by using the Euclidean distance between points. If you include two input variables in a kernel, you are implementing the Euclidean distance between points in the 2D input space (lat x lon) which is what you want: the geographic Euclidean distance between points. Then notice that you don’t need to compute and use a distance matrix.

So I’ve think I’ve since managed to implement this. Now it’s a matter of writing a more efficient model so that hopefully it’ll run… I’ll look into INLA!

HI!

Well, it all depends on the model that you want to fit. For example, if you want to fit a model with a Gaussian random field as spatial random effect (aka GP in 2D) and some covariates using Stan, then the fitting should be slow for a large number of observations (especially if your likelihood is not Gaussian).

I have never used a HSGPs model, but if you use some correlation function (e.g. the Matern function) to calculate the distance between spatial locations, then it would be interesting to obtain the posterior distribution for the parameters that governs the correlation structure in the spatial domain. In this case, the Range is an important parameter of the Matern function. However, if you are not interested in the posterior distribution for the parameters of the correlation function, then you could introduce the covariance matrix as an input in the Stan code (as Gabriel Riutort Mayol says) .

Thus, If you want to fit a Bayesian model where the correlation is based on the distance between points and you want to get the posterior distribution for all the parameters in the model, then use INLA. Otherwise, if you want to use an MCMC method to get the posteriors then you could try with TMB/tmbstan. Unfortunately, this option will be slow compared with INLA anyway.

Cheers,

Hey there, sorry I don’t think I’m really following you. I’ve now written what I believe is a multivariate form of the HSGP with two dimensions, namely lon and lat in the matrix x, using the following functions:

functions {
  // eigenvalues (2D, vectorised)
  vector get_lambda(vector L, array[] int j) {
    vector[2] lambda = square((to_vector(j) * pi()) ./ (2 * L));
    return lambda;
  }
	
  // eigenvectors (2D)
  vector get_phi(vector L, array[] int j, matrix x) {
    int n_obs = rows(x);
    int D = 2;
    matrix[n_obs, D] phi;
    for (d in 1:D) {
      phi[, d] = 1 / sqrt(L[d]) * sin(j[d] * pi() * (x[, d] * L[d]) / (2 * L[d]));
    } // d
    return phi[, 1] .* phi[, 2];
  }
	
  // spectral density (2D, vectorised)
  vector get_spd(real alpha, real l, matrix omega) {
    int M = rows(omega);
    vector[M] s = square(alpha) * 2 * pi() * square(l) * exp(-0.5 * square(l) * rows_dot_self(omega));
    return s;
  }
}

Then in my model the code is this (ignore the weird indexing due to this happening inside a partial sum function:

model {
  for (i in 1:n_spp) {
  // spectral densities
  spd[i] = get_spd(exp(log_alpha[ii]), exp(log_l[ii]), sqrt(lambda));
  // for each hypervolume
  for (j in 1:n_hyp[ii]) {
  // HSGP realisations
    f[i, j][:, 1:n_obs[ii, j]] = diag_pre_multiply(exp(log_w[ii]), 
                                                   diag_post_multiply(beta[ii, j], sqrt(spd[i]))) 
                                 * phi[ii, j][:, 1:n_obs[ii, j]];
    }
  }
}

Here, the beta, spd, and phi (the latter being computed in transformed data not shown here) components follow the Riutart Mayol paper—other than beta now being a matrix instead of a vector of standard normals to account the multivariate part—and I think/hope the log_w part is the “inverse scales” as per the Stan Functions Reference for multivariate Gaussian processes. Unfortunately there’s no references there in the manual so I’m not exactly sure if I’ve done it correctly here, but I assume w is another vector of scale parameters for each of the multivariate dimensions… would love to hear @avehtari’s thoughts on whether this looks correct!

Cheers,

Matt

Have you compared to Gabriel’s code https://github.com/gabriuma/basis_functions_approach_to_GP/tree/master/multi_dimensional/diabetes_dataset ?

Yes, that’s where I got it all from, but there’s no multivariate output example there, only multidimensional input.

Ah, I missed that. If there is no hierarchical prior, then you can just do them separately for each output. Hierarchical priors can easily be added for alpha, l, or beta

1 Like