Hi all,
I’m currently working with some spatial data (spatial transcriptomics to be exact), which is a rather new area for me. My basic idea is to use a Gaussian process (GP) to estimate the mean and variance of gene expression in order to identify genes that vary significantly across Euclidean space. The data take the following form (after transforming the counts matrix to a long data.frame
):
spot | gene | x | y | count |
---|---|---|---|---|
Spot 1 | Gene 1 | 1.3 | 0.7 | 10 |
Spot 2 | Gene 2 | 2.4 | -1.2 | 0 |
Spot 3 | Gene 3 | -3.1 | 4.3 | 2 |
… | … | … | … | … |
Spot n_s | Gene n_g | 0.5 | 1.9 | 15 |
The model I’m using for the mean \mu and overdispersion (shape) \theta is as follows, where i = 1, \dots, n_s is the index of spots (locations), g = 1, \dots, n_g is the index of genes, the subscript 0 indicates a global intercept, \gamma is a random intercept, and \ell is the length-scale of the GP:
I’d like to use the formula for the Negative-binomial variance (shown below) to obtain per-gene estimates of expression variance across space given estimates of \mu_g and \theta from the model.
I posted once a while ago about how to set up the GP model - which I believe I’ve now done successfully - but I’m still unsure of how to summarize the draws from the approximate posterior in a way that will provide me with \hat{\mu}_g, since I don’t know what the estimated parameters of the GP mean.
The code I’m using to fit the model is as shown below, assuming we have a data.frame
of the form shown in the table above called expr_df
. I’m using brms
(the most recent development version from GitHub) with the cmdstanr
(also the most recent version) backend to fit the variational inference model:
model_formula <- bf(count ~ 1 + gp(x, y, scale = TRUE, k = 20, cov = "exp_quad") + (1 | gene),
shape ~ 1)
brms_fit <- brm(model_formula,
data = expr_df,
family = negbinomial(link = "log", link_shape = "log"),
chains = 1L,
iter = 1000L,
warmup = 250L,
thin = 5L,
cores = 1L,
threads = 4L,
normalize = FALSE,
silent = 2,
backend = "cmdstanr",
algorithm = "meanfield",
stan_model_args = list(stanc_options = list("O1")),
seed = 312)
I then generate draws from the approximate posterior like so:
posterior_samples <- as.data.frame(posterior::as_draws_df(brms_fit))
I know how to extract and summarize the random intercepts for each gene along with the global intercept, but what’s confusing me is the parameters for the GP. They’re named as zgp_gpxy[m]
for m = 1, \dots, 400. I assume this is because I set the number of basis functions approximating the GP as k = 20 and there are two spatial dimensions, ergo 20^2 = 400. How should I include these parameters in the computation of \mu_{g_j} = \exp(\eta_{g_j}) for each posterior draw j = 1, \dots, 1000 ?
Any help is greatly appreciated !
Thanks,
Jack