Spatial hierachical negative-binomial model formulation

Hi all,

I’ll start with a brief biological motivation. I’m working with spatial transcriptomics data, which is composed of “spots”, each with an x and y coordinate location and a measured, nonnegative value for gene expression. Gene expression is known to follow a negative-binomial distribution (see the supplemental material of this paper for a discussion of such), and i’m interested in modeling the gene-specific mean and variance of expression using brms (or just Stan if utilizing brms ends up being impossible).

The issue is, i’m unsure of how to account for the spatial component in my model formulation, as I’m not very experienced in spatial analysis. The data are in long format with five columns: one each for gene name (denoted gene), spot ID (spot), x location (x), y location (y), and integer expression (count). The response variable of interest is gene expression, and I’d like to fit a distributional regression model that models the parameters of the negative-binomial distribution \mu and \theta as functions of spatial location with a random effect for each gene. The goal is thus for each gene to sample from the posterior distributions of (and then summarize) the mean \left(\mu_g\right) and variance \left( \mu_g + \frac{\mu_g^2}{\theta_g}\right) of expression while pooling information across genes and accounting for spatial relationships between spots.

I read this bit of brms documentation relating to the spatial conditional autoregressive (CAR) structure, but it appears that the dimension of the adjacency matrix must match that of the data to be modeled, which is not the case with my data as there are many observations (spots) for each gene. Next, I tried using a Gaussian process on x and y via the following formula (ignoring modeling \theta for simplicity), but the model failed to converge and any attempts to tinker with it resulted in errors:

model_formula <- brms::bf(count ~ 1 + g(x, y) + (1 | gene))
brms_fit <- brms::brm(model_formula, 
                      data = count_data,  # assume count_data contains the relevant dataset
                      family = brms::negbinomial(link = "log", link_shape = "log"),
                      chains = 3L,
                      iter = 1000L,
                      warmup = 250L,
                      thin = 5L,
                      cores = 3L,
                      threads = 2L,
                      normalize = FALSE, 
                      silent = 2,
                      backend = "cmdstanr",
                      algorithm = "meanfield",
                      stan_model_args = list(stanc_options = list("O1")), 
                      seed = 312)

Lastly, I considered using the following formula (once again ignoring \theta for simplicity), but I’m unsure if using random slopes for the location coordinates accurately accounts for the spatial relationships between spots:

model_formula <- brms::bf(count ~ 1 + x + y + (1 + x + y | gene))
brms_fit <- brms::brm(model_formula, 
                      data = count_data,  # assume count_data contains the relevant dataset
                      family = brms::negbinomial(link = "log", link_shape = "log"),
                      chains = 3L,
                      iter = 1000L,
                      warmup = 250L,
                      thin = 5L,
                      cores = 3L,
                      threads = 2L,
                      normalize = FALSE, 
                      silent = 2,
                      backend = "cmdstanr",
                      algorithm = "meanfield",
                      stan_model_args = list(stanc_options = list("O1")), 
                      seed = 312)

I imagine this type of model might be too complex to fit using just brms, but I’m very open to using plain Stan as well. In addition, if anything I’ve posited is incorrect please correct me ! I’m also happy to provide R code to load and pre-process the data if it’s helpful, but I figured I’d omit that for now to keep things simpler as this is already a rather complex question.

Thanks much,
jack

1 Like

I would start with a Hilbert space approximate GP with an exponential kernel, just as a simple model to get started. The latest GitHub version of {brms} can handle a larger variety of GP kernels, so something like gp(x, y, scale = TRUE, k = 20, cov = 'exponential') should work. It may also make sense to not assume isotropic smoothing, so you may want to set iso = FALSE. Given the log link, it will often also make sense to put a more reasonable prior on the marginal SD parameter of the GP. If this model samples reasonably well, you can then start to expand up to more complexity ( higher k or Matern kernels dor example).

Hi Jack, thanks for the clear explanation of your case. I would strongly recommend to first fit a spatial model for only mu and to assume that the shape parameter that governs the overdispersion to be the same everywhere and for every gene. NB models can be very badly behaved when combined with random or spatially correlated effects as the overdispersion parameter has so much power to explain residual variation. Maybe from there you try to see if you really need to let the shape parameter vary by gene and/or spatial location?

Thank you so much for the thoughtful reply - I installed the latest version of brms from GitHub, and also updated CmdStan to 2.36.0. I then tried the following model:

model_formula <- brms::bf(count ~ 1 + gp(x, y, scale = TRUE, k = 20, cov = "exponential") +  (1 | gene), 
                          shape ~ 1)
brms_fit <- brms::brm(model_formula, 
                      data = counts_data,
                      family = brms::negbinomial(link = "log", link_shape = "log"),
                      chains = 3L,
                      iter = 1000L,
                      warmup = 250L,
                      thin = 5L,
                      cores = 3L,
                      threads = 2L,
                      normalize = FALSE, 
                      silent = 2,
                      backend = "cmdstanr",
                      algorithm = "meanfield",
                      stan_model_args = list(stanc_options = list("O1")), 
                      seed = 312)

This model converged well and was actually rather quick to fit, which was a pleasant surprise. I’m open to adding more complexity as you suggested, but I’m not quite sure how to evaluate whether doing so is necessary. Any pointers would be super helpful, and thank you again !

Thank you ! I tried to be as clear as possible, especially given that this forum is not biologically-focused. I think you’re right about the shape parameter unfortunately; as shown in my reply to the other commenter I’m going to try modeling it as just a global intercept and go from there.

Good to hear. Are you sticking with vatiational approximation? I tend to find it performs poorly for latent smooth/GP terms. But if look at posterior predictive checks and consider bumping up k, and also trying Matern kernels

Ah that’s good to know r.e. the potential drawbacks of the variational approach. I’m a bit biased towards keeping the VI method as opposed to sampling as it’s quite a bit faster, and my data are really large (tens of thousands of genes, thus tens of millions of rows after conversion to long format). I’ll play around with it though and try out various implementations of each method.

One last question, if you don’t mind helping – how would I go about computing the per-gene mean \mu_g after sampling from the posterior ? Here’s how I currently do the sampling:

posterior_samples <- as.data.frame(posterior::as_draws_df(brms_fit))

This results in 1000 samples from each of 507 parameters. I know what most of the parameters are e.g., b_Intercept is the global intercept for \mu, r_gene[geneA, Intercept] is the random intercept for geneA, etc. But I am unsure of how to incorporate the parameters for the GP e.g., sdgp_gxy and lscale_gpxy into the computation of \mu_g. Any guidance on this would be greatly appreciated !

Edit: there are other GP parameters that I failed to notice since they were the last columns in posterior_samples. The parameters are named as zgp_gpxy[i] for i \in 1, \dots, 400. I assume these should be incorporated somehow when computing \mu_g, but there’s nothing in the brms docs on how to do so and looking on the brms GitHub didn’t help much either.