Formulating a hierarchical multivariate Gaussian process using the Hilbert space approximation

Hi all,

I’m currently working on what I think - based on perusing these forums, the Stan docs, the brms and cmdstanr docs, etc. - is a new-ish type of Gaussian process model. The model combines the Hilbert space approximation and a hierarchical structure via hyperpriors. What I’m hoping to get is a basic sanity check on my approach, as well as perhaps some guidance on how to better optimize my Stan code as I’m relatively new to the language. As of now, the model compiles correctly and the results mostly make sense, so I’m not dealing with any explicit errors or warnings.

The project I’m working on concerns spatial transcriptomics data i.e., an n_s \times n_g spot-by-gene matrix of normalized & scaled (thus approximately Gaussian-distributed) gene expression. This is then converted to a long-format data.frame with columns gene, spot, and expression having n_s \times n_g rows. Each spot also has an associated (x, y) coordinate specifying its location in Euclidean space.

As an example, consider the brain dataset from 10X Genomics made available in the SeuratData package. First I load the data, then filter out low-quality genes i.e., those expressed in <10 spots or those with a mean expression of <0.1.

# load packages 
library(dplyr)
library(Seurat)
library(cmdstanr)
# SeuratData::InstallData("stxBrain") -- run once if not run previously 
seu_brain <- SeuratData::LoadData("stxBrain", type = "anterior1")
gene_set_1 <- Matrix::rowSums(seu_brain@assays$Spatial$counts > 0) >= 10L
gene_set_2 <- Matrix::rowMeans(seu_brain@assays$Spatial$counts) >= 0.1
genes_keep <- rownames(seu_brain)[gene_set_1 & gene_set_2]
seu_brain <- subset(seu_brain, features = genes_keep)

Next, the raw, integer-valued expression data are normalized using the SCTransform tool:

seu_brain <- SCTransform(seu_brain,
                         assay = "Spatial",
                         variable.features.n = 3000L,
                         vst.flavor = "v2",
                         return.only.var.genes = FALSE,
                         seed.use = 312,
                         verbose = FALSE)

I then estimate a global length-scale \ell and a matrix of basis functions \boldsymbol{\phi} to approximate the GP as a Hilbert space like so, assuming an exponentiated quadratic covariance kernel:

# extract spatial coordinates
spatial_df <- GetTissueCoordinates(seu_brain) %>%
              relocate(cell) %>%
              rename(spot = cell)
spatial_mtx <- scale(as.matrix(select(spatial_df, -spot)))
# convert gene expression matrix to long-format 
expr_mtx <- GetAssayData(seu_brain,
                         assay = "SCT",
                         layer = "scale.data")
expr_df <- as.data.frame(expr_mtx) %>%
           mutate(gene = rownames(.), .before = 1) %>%
           tidyr::pivot_longer(cols = !gene,
                               names_to = "spot",
                               values_to = "expression") %>%
           mutate(gene = factor(gene, levels = unique(gene)),
                  spot = factor(spot, levels = unique(spot)))
# estimate global length-scale and matrix of basis functions 
M <- nrow(spatial_mtx)
k <- 20
kmeans_centers <- kmeans(spatial_mtx, centers = k, iter.max = 20L)$centers
dists_centers <- as.matrix(dist(kmeans_centers))
lscale <- median(dists_centers[upper.tri(dists_centers)])
phi <- matrix(0, nrow = M, ncol = k)
for (i in seq(k)) {
  d2 <- rowSums((spatial_mtx - matrix(kmeans_centers[i, ], nrow = M, ncol = 2, byrow = TRUE))^2)
  phi[, i] <- exp(-d2 / (2 * lscale^2))
}

The Stan model is stored in the file gp-test.stan and contains the following code:

data {
  int<lower=1> M;  // number of spots
  int<lower=1> N;  // number of gene-spot pairs in long dataframe 
  int<lower=1> G;  // number of genes 
  int<lower=1> k;  // number of basis functions used to approximate GP 
  array[N] int<lower=1, upper=M> spot_id;  // unique ID for each spot 
  array[N] int<lower=1, upper=G> gene_id;  // unique ID for each gene 
  matrix[M, k] phi;  // matrix of basis functions used to approximate GP 
  vector[N] y;  // vector of normalized gene expression used as response variable 
}

parameters {
  vector[G] beta0;  // vector of gene-specific intercepts 
  matrix[G, k] alpha;  // matrix of gene-specific coefficients for each basis function 
  real<lower=0> sigma_y;  // observational noise of response variable (normalized gene expression)
  vector<lower=0>[G] amplitude;  // vector of gene-specific marginal standard deviations of the approximate GP 
  real mu_amplitude;  // mean for the marginal standard deviation 
  real<lower=0> sigma_amplitude;  // variance for the marginal standard deviation
  real mu_beta0;  // mean for the gene-specific intercepts 
  real<lower=0> sigma_beta0;  // variance for the gene-specific intercepts 
  vector[k] mu_alpha;  // vector of means for the basis function coefficients 
  vector<lower=0>[k] sigma_alpha;  // variances for the bsis function coefficients 
}

model {
  mu_beta0 ~ normal(0, 5);
  sigma_beta0 ~ normal(0, 2);
  mu_alpha ~ normal(0, 2);
  sigma_alpha ~ normal(0, 2);
  amplitude ~ lognormal(0, 1);
  sigma_y ~ normal(0, 2);
  mu_amplitude ~ normal(0, 1);
  sigma_amplitude ~ normal(0, 1);
  for (i in 1:G) {
    beta0[i] ~ normal(mu_beta0, sigma_beta0);
    for (j in 1:k) {
      alpha[i, j] ~ normal(mu_alpha[j], sigma_alpha[j]);
    }
    amplitude[i] ~ lognormal(mu_amplitude, sigma_amplitude);
  }
  for (i in 1:N) {
    int g = gene_id[i];
    int p = spot_id[i];
    real w_i = dot_product(phi[p], alpha[g]);
    y[i] ~ normal(beta0[g] + amplitude[g]^2 * w_i, sigma_y);
  }
}

Lastly, I pass the data to Stan and fit the model using cmdstanr with the meanfield variation inference (VI) algorithm as shown below. As a note, I’m using the cmdstanr package v0.8.0 and CmdStan v2.36.0.

data_list <- list(M = M,
                  N = nrow(expr_df),
                  G = length(unique(expr_df$gene)),
                  k = k,
                  spot_id = expr_df$spot,
                  gene_id = expr_df$gene,
                  phi = phi,
                  y = expr_df$expression)
mod <- cmdstan_model("../gp-test.stan",
                     stanc_options = list("O1"),
                     threads = 2L)
fit_vi <- mod$variational(data_list,
                          seed = 312,
                          algorithm = "meanfield",
                          iter =  3000L,
                          draws = 1000L)
draws_vi <- fit_vi$draws()
draws_vi_summary <- fit_vi$summary()

Does the above approach make sense as a means of estimating a hierarchical approximate GP ? If not, in what way have I misspecified the model / assumptions / data ? Are the basic priors I’ve chosen serviceable, or should they be modified ? Does using the VI approach instead of MCMC sampling via the NUTS algorithm (for speed’s sake, mostly) make sense in this case, or is VI known to behave poorly for this type of application ? Any and all guidance is greatly appreciated !

Thanks much,
Jack

Sorry we didn’t get to this sooner. Also I’m hardly a GP expert, so I don’t know how novel this is. The people to ask are @avehtari and @rtrangucci.

I just wanted to point out that you probably are going to want non-centered priors for the hierarchical parameters. That’s a bit tricky on the lognormal scale, so let me walk you through it.

parameters {
  vector<lower=0>[G] amplitude;
...
model {
   amplitude[i] ~ lognormal(mu_amplitude, sigma_amplitude);

is probably going to be more efficient done this way:

parameters {
  vector[G] std_log_amplitude;
transformed parameters {
  vector[G] amplitude = exp(mu_amplitude + sigma * std_log_amplitude);
model {
  std_log_amplitude ~ normal(0, 1);

This gives you a non-centered lognormal model. I should put that in the User’s Guide for future reference as it comes up a lot. Then you need to do the same for the one in the loop, but that’s just an ordinary normal, so you don’t need the exp. Same for beta0.

You can also vectorize, e.g.,

beta0 ~ normal(mu_beta0, sigma_beta0);

And you really want to get that normal into vectored form with. matrix products.

matrix[M, k] phi; 
matrix[G, k] alpha;
vector<lower=0>[G] amplitude; 
...
for (i in 1:N) {
    int g = gene_id[i];
    int p = spot_id[i];
    real w_i = dot_product(phi[p], alpha[g]);
    y[i] ~ normal(beta0[g] + amplitude[g]^2 * w_i, sigma_y);
  }

can be

matrix[M, k] phi;
matrix[k, G] alpha_t;  // transposed from original

vector[G] amplitude_sq = square(amplitude);
matrix[M, G] phi_alpha = phi * alpha_t;
vector[N] w;
for (n in 1:N) {
  real w[n] = phi_alpha[spot_id[n], gene_id[n]];
}
y ~ normal(beta0[gene_id] + amplitude_sq[gene_id] .* w, sigma_y);

It may seem like more work with the allocations but loops without arithmetic or special functions re very fast in Stan and this allows the normal distribution to be vectored and the phi_alpha variable to be computed efficiently. It also limits the amount of squaring to the size ofG. Now if there is really sparse data in that N is a lot smaller than G * M, then the reduction to a matrix operation might not be a savings.

Our current implementation of ADVI is not very reliable. You might have better luck with Pathfinder. Either way, these VI methods are approximations and unless your posterior is roughly multivariate normal, they won’t be very good approximations.

If you can, try using MCMC and VI for a problem and see if they agree in areas you care about.

You can also skip the MCMC and just run posterior predictive checks directly on the VI output (we just wouldn’t expect them to pass because it’s an approximation).

1 Like

It looks like you have the magnitude of each GP twice as you have sigma_alpha and amplitude.

See also discussion in Hilbert space Gaussian process for multiple time series

Unfortunately, right now I don’t have time to examine your code in more detail.

That’s ok r.e. the time constraint, I appreciate your input ! Per my current understanding (feel free to correct me if I’m incorrect), the sigma_alpha parameter is the SD of the prior distribution for the gene-specific coefficient vector \boldsymbol{\alpha}_g corresponding to the basis function matrix \Phi, while the amplitude parameter is what scales the spatial effect of the approximate GP i.,e.:

\boldsymbol{\alpha}_g \sim \text{Gaussian}(\mu_{\alpha}, \sigma_{\alpha})

The notation I’m using for the full approximate hierarchical GP is as follows, where e_i is the scaled, normalized gene expression vector at spot i and \tau_{g(i)} is the gene-specific amplitude parameter:

e_i \sim \text{Gaussian} \left(\beta_{0_{g(i)}} + \tau_{g(i)}^2 \boldsymbol{\phi}(s(i))^\intercal \boldsymbol{\alpha}_{g(i)}, \sigma^2_e \right)

I might be messing up the notation / my interpretation of the model as I’m rather new to GPs, so I’m absolutely open to criticism of both.

Thanks again !

I really appreciate the time put into your response – it’s very helpful. I’m still a bit confused as to how to put all your great suggestions together, though they make sense individually. Does this Stan code seem reasonable ?

data {
  int<lower=1> M;  // number of spots
  int<lower=1> N;  // number of gene-spot pairs in long dataframe
  int<lower=1> G;  // number of genes
  int<lower=1> k;  // number of basis functions used to approximate GP
  array[N] int<lower=1, upper=M> spot_id;  // unique ID for each spot
  array[N] int<lower=1, upper=G> gene_id;  // unique ID for each gene
  matrix[M, k] phi;  // matrix of basis functions used to approximate GP
  vector[N] w;  // vector of 
  vector[N] y;  // vector of normalized gene expression used as response variable
}

parameters {
  vector[G] beta0;  // vector of gene-specific intercepts
  matrix[k, G] alpha_t;  // transposed matrix of gene-specific coefficients for each basis function
  real<lower=0> sigma_y;  // observation noise of response variable (normalized, scaled gene expression)
  vector[G] std_log_amplitude;  // vector of logs of gene-specific amplitudes of the approximate GP
  real mu_amplitude;  // mean for the marginal SD
  real<lower=0> sigma_amplitude;  // vector of SDs for the amplitude
  real mu_beta0;  // mean for the gene-specific intercepts
  real<lower=0> sigma_beta0;  // vector of SDs for the gene-specific intercepts
  vector[k] mu_alpha;  // vector of means for the basis function coefficients
  vector<lower=0>[k] sigma_alpha;  // vector of SDs for the basis function coefficients
}

transformed parameters {
  vector[G] amplitude = exp(mu_amplitude + sigma_amplitude * std_log_amplitude);
  vector[G] amplitude_sq = square(amplitude);
  matrix[M, G] phi_alpha = phi * alpha_t;
}

model {
  mu_beta0 ~ normal(0, 3);
  sigma_beta0 ~ normal(0, 2);
  mu_alpha ~ normal(0, 2);
  sigma_alpha ~ normal(0, 2);
  sigma_y ~ normal(0, 2);
  mu_amplitude ~ normal(0, 2);
  sigma_amplitude ~ normal(0, 1);
  beta0 ~ normal(mu_beta0, sigma_beta0);
  std_log_amplitude ~ normal(mu_amplitude, sigma_amplitude);
  for (i in 1:G) {
    for (j in 1:k) {
      alpha[i, j] ~ normal(mu_alpha[j], sigma_alpha[j]);
    }
  }
  for (i in 1:N) {
    real w[i] = phi_alpha[spot_id[i], gene_id[i]];
  }
  y ~ normal(beta0[gene_id] + amplitude_sq[gene_id] .* w, sigma_y);
}

As for the other bits, I had an inkling based on what I’ve read elsewhere that the meanfield VI algorithm might not be the best choice. Currently (on the example dataset I’m using composed of around 7.3m observations) it takes around 2500 iterations for the median ELBO to reach convergence. This takes around an hour on my M2 Mac Mini. I did try MCMC sampling as well, but it failed to finish running after being left overnight (4 chains, 4 cores, 2 parallel threads per chain). I’ll definitely try the Pathfinder algorithm as well, as I haven’t done so yet. Lastly, do you have any guidance on what sort of PPCs should be performed on the VI output (if indeed I do end up using VI over MCMC) ?

OK I realized that I made some crucial errors in the revised code I added above. Specifically, w should be in the transformed parameters block instead of the data block, and the indexing for alpha_t in the double loop was incorrect. The corrected code is below:

data {
  int<lower=1> M;  // number of spots
  int<lower=1> N;  // number of gene-spot pairs in long dataframe
  int<lower=1> G;  // number of genes
  int<lower=1> k;  // number of basis functions used to approximate GP
  array[N] int<lower=1, upper=M> spot_id;  // unique ID for each spot
  array[N] int<lower=1, upper=G> gene_id;  // unique ID for each gene
  matrix[M, k] phi;  // matrix of basis functions used to approximate GP
  vector[N] y;  // vector of normalized gene expression used as response variable
}

parameters {
  vector[G] beta0;  // vector of gene-specific intercepts
  matrix[k, G] alpha_t;  // transposed matrix of gene-specific coefficients for each basis function
  real<lower=0> sigma_y;  // observation noise of response variable (normalized, scaled gene expression)
  vector[G] std_log_amplitude;  // vector of logs of gene-specific amplitudes of the approximate GP
  real mu_amplitude;  // mean for the marginal SD
  real<lower=0> sigma_amplitude;  // vector of SDs for the amplitude
  real mu_beta0;  // mean for the gene-specific intercepts
  real<lower=0> sigma_beta0;  // vector of SDs for the gene-specific intercepts
  vector[k] mu_alpha;  // vector of means for the basis function coefficients
  vector<lower=0>[k] sigma_alpha;  // vector of SDs for the basis function coefficients
}

transformed parameters {
  vector[G] amplitude = exp(mu_amplitude + sigma_amplitude * std_log_amplitude);
  vector[G] amplitude_sq = square(amplitude);
  matrix[M, G] phi_alpha = phi * alpha_t;
  vector[N] w;
  for (i in 1:N) {
    w[i] = phi_alpha[spot_id[i], gene_id[i]];
  }
}

model {
  mu_beta0 ~ normal(0, 1);
  sigma_beta0 ~ normal(0, 1);
  mu_alpha ~ normal(0, 1);
  sigma_alpha ~ normal(0, 0.3);
  sigma_y ~ normal(0, 1);
  mu_amplitude ~ normal(0, 1);
  sigma_amplitude ~ normal(0, 0.3);
  beta0 ~ normal(mu_beta0, sigma_beta0);
  std_log_amplitude ~ normal(mu_amplitude, sigma_amplitude);
  for (i in 1:G) {
    for (j in 1:k) {
      alpha_t[j, i] ~ normal(mu_alpha[j], sigma_alpha[j]);
    }
  }
  y ~ normal(beta0[gene_id] + amplitude_sq[gene_id] .* w, sigma_y);
}

However, when I attempt to actually fit the model using the meanfield VI algorithm, the ELBO diverges immediately after the adaptation stage. When I try to use MCMC sampling instead, I get a string of errors saying Exception: normal_lpdf: Scale parameter is 0, but must be positive!. I believe based on this thread this means that, on the log-scale, one or more of the scale parameters named sigma_* is being sampled as a very large negative number, thus ending up very close to zero after it is exponentiated back to the correct scale. This is odd seeing as how the priors are set to not exhibit much variability.

So:

  1. Would utilizing different priors or reparamaterizing help ?
  2. I tried scaling the basis function matrix \Phi prior to passing it into Stan, but this didn’t solve anything.
  3. Did i just mess something up in the Stan code ? It compiles without error, but I’m aware that doesn’t mean there are no problems with it.
  4. Is it possible that the exponentiation and then squaring of the gene-specific amplitude parameter leading to extremely large values ?

Thanks again for all the input, it’s been a huge help !

-Jack

@rtrangucci do you have any thoughts on this ? i’m still struggling a bit with finalizing the Stan code.

Thanks much !
jack