Estimating log-likelihood in an Gaussian process model

Hi all,

I’m currently wrapping up an implementation of an approximate hierarchical Gaussian process (GP) model in Stan. The model itself works quite well – results make sense both numerically and visually, convergence is reliably & quickly reached on varying datasets, etc. However, I’d like to be able to provide end users with estimates of the model’s log-likelihood and accompanying Bayesian information criterion (BIC) so that they can compare runs of the model with varying parameters. From what I’ve read online e.g., this loo vignette and this forum post, the correct way to do this is to declare a vector called log_lik in the generated quantities block and fill it in using the correct _lpmf or _lpdf function. As such, I wrote the following Stan code to fit the model and estimate pointwise log-likelihood:

data {
  int<lower=1> M;  // number of spots
  int<lower=1> N;  // number of gene-spot pairs in long dataframe
  int<lower=1> G;  // number of genes
  int<lower=1> k;  // number of basis functions used to approximate GP
  array[N] int<lower=1, upper=M> spot_id;  // unique ID for each spot
  array[N] int<lower=1, upper=G> gene_id;  // unique ID for each gene
  matrix[M, k] phi;  // matrix of basis functions used to approximate GP
  vector[G] gene_depths;  // vector of logged gene-level sequencing depths to adjust for in the model
  vector[N] y;  // vector of normalized, scaled gene expression used as response variable
}

parameters {
  real beta0;  // global intercept
  real beta1;  // coefficient for gene library size
  matrix[k, G] alpha_t;  // transposed matrix of gene-specific coefficients for each basis function
  real<lower=0> sigma_y;  // observation noise of response variable
  vector<lower=0>[G] amplitude;  // vector of gene-specific amplitudes of the approximate GP
  real mu_amplitude;  // mean for the amplitude
  real<lower=0> sigma_amplitude;  // SD for the amplitude
  vector[k] mu_alpha;  // vector of means for the basis function coefficients
  vector<lower=0>[k] sigma_alpha;  // vector of SDs for the basis function coefficients
}

transformed parameters {
  matrix[M, G] phi_alpha;
  phi_alpha = phi * alpha_t;
  vector[N] w;
  for (i in 1:N) {
    w[i] = phi_alpha[spot_id[i], gene_id[i]];
  }
  vector[G] amplitude_sq = square(amplitude);
}

model {
  beta0 ~ normal(0, 2);
  beta1 ~ normal(0, 2);
  mu_alpha ~ normal(0, 2);
  sigma_alpha ~ std_normal();
  mu_amplitude ~ normal(0, 2);
  sigma_amplitude ~ std_normal();
  sigma_y ~ normal(0, 2);
  for (i in 1:G) {
    alpha_t[, i] ~ normal(mu_alpha, sigma_alpha);
  }
  amplitude ~ lognormal(mu_amplitude, sigma_amplitude);
  y ~ normal(beta0 + beta1 * gene_depths[gene_id] + amplitude_sq[gene_id] .* w, sigma_y);
}

generated quantities {
  array[N] real log_lik;
  for (i in 1:N) {
    real mu_i;
    mu_i = beta0 + beta1 * gene_depths[gene_id[i]] + amplitude_sq[gene_id[i]] * phi_alpha[spot_id[i], gene_id[i]];
    log_lik[i] = normal_lpdf(y[i] | mu_i, sigma_y);
  }
}

I use variational inference via Stan’s ADVI algorithm using the cmdstanr interface to fit the model as shown below (the variable data_list contains all the data necessary to fit the model):

fit_vi <- mod$variational(data_list,
                          seed = 312,
                          init = 0,
                          algorithm = "meanfield",
                          iter = 4000L,
                          draws = 1000L,
                          elbo_samples = 150L)

The issue occurs when I try to generate draws from the approximate posterior like so:

amplitude_summary <- fit_vi$summary(variables = "amplitude")
loglik_draws <- posterior::as_draws_df(fit_vi$draws(variables = "log_lik"))

Neither the summary() nor the draws() methods are able to run without erroring out due to either 1) an out-of-memory error or 2) a no space left on device error. I believe this is due to the vast number of parameters being estimated by the model, which makes the output CSV files insanely large due to the huge number of columns.

Thus, my question is – is there any way to refactor either my model block or my generated quantities block in order to be more efficient ? In reality I only need draws for the amplitude parameter and (ideally) the log-likelihood – maybe there’s a way I’m unaware of to force CmdStan to only generate output files for that subset of parameters ? Any help is greatly appreciated !

-jack

How big is k * G? That will determine how much is required for the output.

I don’t know if cmdstanr streams output—you may be able to set it to do that without saving the results in memory. I think you can do that with cmdstanpy and with rstan.

If this is really a matter of too much output, the best thing to do is move all the parameters declared as transformed parameters into the model block. Especially if M * G or N is large. You will need 8 * ((M + 1)* G + N) bytes of memory to store a single draw just for the transformed parameters.

Thanks for the input – k \times G \approx 60,000 usually, with M \approx 2,500 and thus N \approx 7,000,000. I ended up moving everything I could to the model block instead as suggested, but the resulting CSVs still eat up my disk space (and I assume would do so for end users too), of which I have about 60GB free.

In case anyone else is interested, the following code using rstan ended up working (though it used a ton of RAM):

stan_file <- "inst/approxGP2.stan"
mod <- rstan::stan_model(stan_file, allow_optimizations = TRUE)
fit_vi <- rstan::vb(mod,
                    data = data_list, 
                    pars = list("beta0", 
                                "beta1", 
                                "alpha_t", 
                                "sigma_y", 
                                "mu_amplitude", 
                                "sigma_amplitude",
                                "mu_alpha", 
                                "sigma_alpha"), 
                    include = FALSE, 
                    seed = 312, 
                    init = 0, 
                    algorithm = "meanfield", 
                    iter = 4000L, 
                    elbo_samples = 150L)

It would be really great if CmdStanR could support a similar interface à la the pars and include arguments of rstan::vb(), since to my knowledge CmdStan is much faster and stays current with the latest Stan releases. I’m aware of this longstanding GitHub issue you created pertaining to the same, but it appears little progress has been made on it. Having already written my entire R package around the usage of CmdStanR I’m loathe to rewrite everything using rstan, but if I want to implement the log-likelihood / BIC calculation it looks like that might be my only option.