Dear Stan community,
I am a recent member of the Stan community and I am trying to do my best to apply what I learn from the User Guide and other Internet material. I really apologize if some of the mistakes I am making in my models are obvious for you but I am still quite unexperienced. I am struggling to implement a simple hierarchical negative binomial model.
This is a differential gene expression model in which the logarithm of the average sample expression (integer counts) are modelled as a negative binomial (through a log link) with a common phi value for all genes G. The expected value from this NB model is modelled for each gene as a sum of the average gene expression for each gene through all conditions(alpha) and the change in expression for each condition modeled as a logFC. The logFC is divided by 2 is summed to the samples where the condition is defined in the design matrix as 1 or substracted in the samples where the condition is defined as 0 in the design matrix (in order to be sure that the alphas are modelling the average gene expression and that the prior for each sample is the same, without choosing a reference sample). The average gene expression of each gene, alpha, is modelled as coming from a normal distribution with an average value of mu_alpha and a standard deviation of sigma_alpha. Although the model I show below is using a non-centered parametrization, I have also tried the centered parametrization in order to solve the fitting problems without success. The change in gene expression per condition, logFC, is modelled as the sum of two variables: TReffect and beta. TReffect is a variable which is the result of the sum of other hierarchical variables that represent the activity of other transcriptional variables which influence gene expression in a way defined by a regulatory matrix (reg_matrix, a boolean matrix that indicates which transcriptional variables are affecting the expression of which genes), a regression of the change on the TRact variables. Beta is a variable that represents the changes in expression in that condition that cannot be explained by the hierarchical variables TRact.
This would be the toy_example.stan model I have described above:
functions {
/**
* Alternative to neg_binomial_2_log_rng() that avoids potential
* numerical problems during warmup
*
* @param eta Mean parameter for NB distribution
* @param phi Dispersion parameter such that Variance=mean+(1/phi)*mean^2
* @return int Value sampled from NB(eta, phi)
*/
int neg_binomial_2_safe_rng(real eta, real phi) {
real gamma_rate;
if (phi / eta == 0)
return -9;
gamma_rate = gamma_rng(phi, phi / eta);
if (gamma_rate >= exp(20.79))
return -10;
return poisson_rng(gamma_rate);
}
}
data {
int<lower=1> G; // Number of genes
int<lower=1> S; // Number of samples
int<lower=1> C; // Number of conditions/covariables/interactions/batch effects
int<lower=1> R; // Number of regulators
int<lower=0> expression[G,S]; // Expression counts per gene G in sample S
matrix<lower=0,upper=1>[C, S] design; // Design matrix (Conditions, interactions, batch effects)
matrix<lower=0, upper=1>[G, R] reg_matrix; // Regulation matrix
}
transformed data{
matrix<lower=-1,upper=1>[C, S] design2 = 2 * design - rep_matrix(1, C, S);
// Modification of the design matrix in order to index change
// so that change in average condition is added in condition 1
// and subtracted in condition 2 to the average sample condition
}
parameters {
real mu_alpha; // Average value of the distribution of the logarithm of average gene expression of each gene
real<lower=0> sigma_alpha; // Standard deviation of the distribution of the logarithm of the average gene expression of each gene
vector[G] alpha_raw; // Non-centered parametrization for the logarithm of the average gene expression value for each gene
matrix[G, C] beta_raw; // Non-centered way of generating changes not explained by TRs
vector<lower=0>[C] sigma_beta; // Standard deviation for changes not explained by TRs
real<lower=0> sigma_TRact; // Standard deviation for TR activity
matrix[R, C] TRact_raw; // Non-centered parametrization for TR acctivity
real<lower=0> inverse_phi; // Parameter that will be transformed into the dispersion of the negative binomial
// row_vector[S] log_norm_factors;
}
transformed parameters {
vector[G] alpha; // Logarithm of the average gene expression for each gene accross all conditions
matrix[G, C] beta; // Change not explained by TRs in the logarithm of expected average gene expression by condition
matrix[R, C] TRact; // TR activity on each condition
matrix[G, C] TReffect; // Effect of TRs by condition on the logarithm of expected average gene expression
matrix[G, C] logFC; // Change on expected average gene expression by condition
matrix[G, S] X; // Logarithm of the expected average gene expression
real<lower=1> phi; // Dispersion parameter of the negative binomial
alpha = 8 + mu_alpha + 0.5 * sigma_alpha * alpha_raw;
TRact = 0.5 * TRact_raw * sigma_TRact;
beta = 0.5 * diag_post_multiply(beta_raw, sigma_beta);
TReffect = reg_matrix * TRact;
logFC = beta + TReffect;
phi = 1 + 1/(inverse_phi)^2;
X = rep_matrix(alpha, S) + (logFC/2) * design2;
// X = rep_matrix(alpha, S) + beta * design2 + rep_matrix(log_norm_factors, G);
}
model {
mu_alpha ~ std_normal() ;
sigma_alpha ~ std_normal() ;
to_row_vector(TRact_raw) ~ std_normal();
sigma_TRact ~ exponential(1);
to_row_vector(beta_raw) ~ std_normal();
sigma_beta ~ exponential(1);
inverse_phi ~ exponential(1);
// log_norm_factors ~ normal(0, 0.05);
for (i in 1:G) {
expression[i, ] ~ neg_binomial_2_log(X[i, ], phi);
}
}
generated quantities {
// pointwise log-likelihood for LOO
matrix[G, S] log_lik;
// replications for posterior predictive checks
int<lower=0> expression_pred[G, S];
for (i in 1:G) {
for (j in 1:S) {
log_lik[i, j] = neg_binomial_2_log_lpmf(expression[i, j] | X[i, j], phi);
expression_pred[i, j] = neg_binomial_2_log_rng(X[i, j], phi);
}
}
}
When I try to sample from this simple model, I always find a lot of problems (using the r code from below):
Warning messages:
1: There were 7998 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
2: Examine the pairs() plot to diagnose sampling problems
3: The largest R-hat is 2.77, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
In order to solve the problems and try to check the computational faithfullness of the model, I am generating data using Stan following exactly the same assumptions used as priors in the model (sample some expression values from the priors). I sample some data values from this sampling model based on prior specification of my main model (using rstan) and try to infer the known values that have generate such sample.
This would be the r code used to generate the data. (I also have tried initializing the parameters at its real values (variable initialization) but the sampling is not improved…
library(rstan)
# Defining the data for sampling from the model
G <- 92
R <- 10
design <- rbind(c(0, 0, 0, 0, 1, 1, 1, 1), c(0, 0, 1, 1, 0, 0, 1, 1))
S <- dim(design)[2]
C <- dim(design)[1]
# Creation of a random regulatory matrix of the required characteristics
# (only one regulator per gene)
reg_matrix <- matrix(0, ncol = R, nrow = G)
set.seed(321)
index <- sample(c(1:10), 92, replace = TRUE, prob = rep(1/10, 10))
for (i in 1:92) {reg_matrix[i, index[i]] <- 1}
reg_matrix <- as.data.frame(reg_matrix)
# Data list for sample generation
simu_data <- list("G" = G, "S" = S, "C" = C, "design" = design,
"R" = R, "reg_matrix" = reg_matrix)
# Sampling with Stan
fit_ensemble <- stan(file=toy_example_ensemble.stan,
data=simu_data, iter=10000, warmup=0, chains=1,
refresh=500, seed=4838282, algorithm="Fixed_param")
# Extracting samples to infer with Stan
simu_expression <- extract(fit_ensemble)$expression
expression_real <- as.data.frame(simu_expression[1,,])
# Extracting true parameter values for initialization
simu1 <- extract(fit_ensemble, permuted = FALSE)
init_points <- as.list(simu1[1, 1, ])
initialization <- list(init_points, init_points, init_points, init_points)
# Fitting the model
input_data <- list("G" = G, "S" = S, "C" = C,
"R" = R, "reg_matrix" = reg_matrix,
"expression" = expression_real,
"design" = design)
options(mc.cores = 4)
mc.cores = parallel::detectCores()
fit <- stan(file=toy_example.stan,
data=input_data, seed=493848, iter = 4000,
refresh = 400, chains = 4, init = initialization,
control = list(max_treedepth = 15))
}
The model for sampling is following every assumption from the model above (even generating data using non-centered parametrization, which I guess makes no difference at all but I just tried anything I could come up with)
toy_example_ensemble.stan
functions {
/**
* Alternative to neg_binomial_2_log_rng() that avoids potential
* numerical problems during warmup
*
* @param eta Mean parameter for NB distribution
* @param phi Dispersion parameter such that Variance=mean+(1/phi)*mean^2
* @return int Value sampled from NB(eta, phi)
*/
int neg_binomial_2_safe_rng(real eta, real phi) {
real gamma_rate;
if (phi / eta == 0)
return -9;
gamma_rate = gamma_rng(phi, phi / eta);
if (gamma_rate >= exp(20.79))
return -10;
if (gamma_rate == 0)
return -11;
return poisson_rng(gamma_rate);
}
}
data {
int<lower=1> G; // Number of genes
int<lower=1> S; // Number of samples
int<lower=1> C; // Number of conditions/covariables/interactions/batch effects
int<lower=1> R; // Number of regulators
matrix<lower=0,upper=1>[C, S] design; // Design matrix (Conditions, interactions, batch effects)
matrix<lower=0, upper=1>[G, R] reg_matrix; // Regulation matrix
}
transformed data{
matrix<lower=-1,upper=1>[C, S] design2 = 2 * design - rep_matrix(1, C, S);
// Modification of the design matrix in order to index change
// so that change in average condition is added in condition 1
// and subtracted in condition 2 to the average sample condition
}
generated quantities {
real mu_alpha; // Average value of the distribution of the logarithm of average gene expression of each gene
real<lower=0> sigma_alpha; // Standard deviation of the distribution of the logarithm of the average gene expression of each gene
vector[G] alpha; // Logarithm of the average gene expression for each gene accross all conditions
vector[G] alpha_raw; // Non-centered parametrization for the logarithm of the average gene expression value for each gene
matrix[G, C] beta_raw; // Non-centered way of generating changes not explained by TRs
vector<lower=0>[C] sigma_beta; // Standard deviation for changes not explained by TRs
matrix[G, C] beta; // Change not explained by TRs in the logarithm of expected average gene expression by condition
real<lower=0> sigma_TRact; // Standard deviation for TR activity
matrix[R, C] TRact_raw; // Non-centered parametrization for generating TR acctivities
matrix[R, C] TRact; // TR activity on each condition
matrix[G, C] TReffect; // Effect of TRs by condition on the logarithm of expected average gene expression
matrix[G, C] logFC; // Change on expected average gene expression by condition
matrix[G, S] X; // Logarithm of the expected average gene expression
matrix[G, S] expression; // Gene expression
real<lower=0> inverse_phi = exponential_rng(1); // Generation of a parameter that will be transformed into the dispersion of the negative binomial
real<lower=1> phi = 1 + 1/(inverse_phi)^2; // Dispersion parameter of the negative binomial
sigma_alpha = fabs(normal_rng(0, 1));
mu_alpha = normal_rng(0, 1);
for (i in 1:G) {
alpha_raw[i] = normal_rng(0, 1);
}
for (i in 1:G) {
alpha[i] = 8 + 0.5 * sigma_alpha * alpha_raw[i];
}
for (i in 1:G) {
for (j in 1:C) {
beta_raw[i, j] = normal_rng(0, 1);
}
}
for (i in 1:C) {
sigma_beta[i] = fabs(normal_rng(0, 1));
}
beta = 0.5 * diag_post_multiply(beta_raw, sigma_beta);
sigma_TRact = fabs(normal_rng(0, 1));
for (i in 1:R) {
for (j in 1:C) {
TRact_raw[i, j] = normal_rng(0, 1);
}
}
TRact = 0.5 * sigma_TRact * TRact_raw;
TReffect = reg_matrix * TRact;
logFC = beta + TReffect;
X = rep_matrix(alpha, S) + (logFC/2) * design2;
for (i in 1:G) {
for (j in 1:S) {
// expression[i, j] = neg_binomial_2_log_rng(X[i, j], phi);
expression[i, j] = neg_binomial_2_safe_rng(exp(X[i, j]), phi);
}
}
}
I started with a max_treedepth of 10 but I have increased it until 15 with no success… I thought about initialization but that didn’t work either. I used different parametrization for alpha without better results. I think that the real problem may be due to my inadequate understanding of the sampling process and how the geometry of my problem affects it. Any help with this would indeed be much appreciated. I ran out of possible solutions.
Thank you for your time and sorry for the length of the message.