Hi all,
I’m currently wrapping up an implementation of an approximate hierarchical Gaussian process (GP) model in Stan. The model itself works quite well – results make sense both numerically and visually, convergence is reliably & quickly reached on varying datasets, etc. However, I’d like to be able to provide end users with estimates of the model’s log-likelihood and accompanying Bayesian information criterion (BIC) so that they can compare runs of the model with varying parameters. From what I’ve read online e.g., this loo
vignette and this forum post, the correct way to do this is to declare a vector called log_lik
in the generated quantities
block and fill it in using the correct _lpmf
or _lpdf
function. As such, I wrote the following Stan code to fit the model and estimate pointwise log-likelihood:
data {
int<lower=1> M; // number of spots
int<lower=1> N; // number of gene-spot pairs in long dataframe
int<lower=1> G; // number of genes
int<lower=1> k; // number of basis functions used to approximate GP
array[N] int<lower=1, upper=M> spot_id; // unique ID for each spot
array[N] int<lower=1, upper=G> gene_id; // unique ID for each gene
matrix[M, k] phi; // matrix of basis functions used to approximate GP
vector[G] gene_depths; // vector of logged gene-level sequencing depths to adjust for in the model
vector[N] y; // vector of normalized, scaled gene expression used as response variable
}
parameters {
real beta0; // global intercept
real beta1; // coefficient for gene library size
matrix[k, G] alpha_t; // transposed matrix of gene-specific coefficients for each basis function
real<lower=0> sigma_y; // observation noise of response variable
vector<lower=0>[G] amplitude; // vector of gene-specific amplitudes of the approximate GP
real mu_amplitude; // mean for the amplitude
real<lower=0> sigma_amplitude; // SD for the amplitude
vector[k] mu_alpha; // vector of means for the basis function coefficients
vector<lower=0>[k] sigma_alpha; // vector of SDs for the basis function coefficients
}
transformed parameters {
matrix[M, G] phi_alpha;
phi_alpha = phi * alpha_t;
vector[N] w;
for (i in 1:N) {
w[i] = phi_alpha[spot_id[i], gene_id[i]];
}
vector[G] amplitude_sq = square(amplitude);
}
model {
beta0 ~ normal(0, 2);
beta1 ~ normal(0, 2);
mu_alpha ~ normal(0, 2);
sigma_alpha ~ std_normal();
mu_amplitude ~ normal(0, 2);
sigma_amplitude ~ std_normal();
sigma_y ~ normal(0, 2);
for (i in 1:G) {
alpha_t[, i] ~ normal(mu_alpha, sigma_alpha);
}
amplitude ~ lognormal(mu_amplitude, sigma_amplitude);
y ~ normal(beta0 + beta1 * gene_depths[gene_id] + amplitude_sq[gene_id] .* w, sigma_y);
}
generated quantities {
array[N] real log_lik;
for (i in 1:N) {
real mu_i;
mu_i = beta0 + beta1 * gene_depths[gene_id[i]] + amplitude_sq[gene_id[i]] * phi_alpha[spot_id[i], gene_id[i]];
log_lik[i] = normal_lpdf(y[i] | mu_i, sigma_y);
}
}
I use variational inference via Stan’s ADVI algorithm using the cmdstanr
interface to fit the model as shown below (the variable data_list
contains all the data necessary to fit the model):
fit_vi <- mod$variational(data_list,
seed = 312,
init = 0,
algorithm = "meanfield",
iter = 4000L,
draws = 1000L,
elbo_samples = 150L)
The issue occurs when I try to generate draws from the approximate posterior like so:
amplitude_summary <- fit_vi$summary(variables = "amplitude")
loglik_draws <- posterior::as_draws_df(fit_vi$draws(variables = "log_lik"))
Neither the summary()
nor the draws()
methods are able to run without erroring out due to either 1) an out-of-memory error or 2) a no space left on device error. I believe this is due to the vast number of parameters being estimated by the model, which makes the output CSV files insanely large due to the huge number of columns.
Thus, my question is – is there any way to refactor either my model
block or my generated quantities
block in order to be more efficient ? In reality I only need draws for the amplitude
parameter and (ideally) the log-likelihood – maybe there’s a way I’m unaware of to force CmdStan
to only generate output files for that subset of parameters ? Any help is greatly appreciated !
-jack