Problems fitting a Multilevel Negative Binomial Model

Dear Stan community,

I am a recent member of the Stan community and I am trying to do my best to apply what I learn from the User Guide and other Internet material. I really apologize if some of the mistakes I am making in my models are obvious for you but I am still quite unexperienced. I am struggling to implement a simple hierarchical negative binomial model.
This is a differential gene expression model in which the logarithm of the average sample expression (integer counts) are modelled as a negative binomial (through a log link) with a common phi value for all genes G. The expected value from this NB model is modelled for each gene as a sum of the average gene expression for each gene through all conditions(alpha) and the change in expression for each condition modeled as a logFC. The logFC is divided by 2 is summed to the samples where the condition is defined in the design matrix as 1 or substracted in the samples where the condition is defined as 0 in the design matrix (in order to be sure that the alphas are modelling the average gene expression and that the prior for each sample is the same, without choosing a reference sample). The average gene expression of each gene, alpha, is modelled as coming from a normal distribution with an average value of mu_alpha and a standard deviation of sigma_alpha. Although the model I show below is using a non-centered parametrization, I have also tried the centered parametrization in order to solve the fitting problems without success. The change in gene expression per condition, logFC, is modelled as the sum of two variables: TReffect and beta. TReffect is a variable which is the result of the sum of other hierarchical variables that represent the activity of other transcriptional variables which influence gene expression in a way defined by a regulatory matrix (reg_matrix, a boolean matrix that indicates which transcriptional variables are affecting the expression of which genes), a regression of the change on the TRact variables. Beta is a variable that represents the changes in expression in that condition that cannot be explained by the hierarchical variables TRact.


This would be the toy_example.stan model I have described above:

functions {
  /**
   * Alternative to neg_binomial_2_log_rng() that avoids potential 
   * numerical problems during warmup
   *
   * @param eta Mean parameter for NB distribution
   * @param phi Dispersion parameter such that Variance=mean+(1/phi)*mean^2
   * @return int Value sampled from NB(eta, phi)
   */
  int neg_binomial_2_safe_rng(real eta, real phi) {
    real gamma_rate;
    if (phi / eta == 0)
      return -9;   
    gamma_rate = gamma_rng(phi, phi / eta);
    if (gamma_rate >= exp(20.79))
      return -10;      
    return poisson_rng(gamma_rate);
  }
}

data {
  int<lower=1> G;                                                 // Number of genes
  int<lower=1> S;                                                 // Number of samples
  int<lower=1> C;                                                 // Number of conditions/covariables/interactions/batch effects
  int<lower=1> R;                                                 // Number of regulators
  int<lower=0> expression[G,S];                                   // Expression counts per gene G in sample S
  matrix<lower=0,upper=1>[C, S] design;                           // Design matrix (Conditions, interactions, batch effects)
  matrix<lower=0, upper=1>[G, R] reg_matrix;                      // Regulation matrix
}

transformed data{
  matrix<lower=-1,upper=1>[C, S] design2 = 2 * design - rep_matrix(1, C, S);     
  // Modification of the design matrix in order to index change 
  // so that change in average condition is added in condition 1 
  // and subtracted in condition 2 to the average sample condition

}

parameters {
  real mu_alpha;                       // Average value of the distribution of the logarithm of average gene expression of each gene
  real<lower=0>  sigma_alpha;          // Standard deviation of the distribution of the logarithm of the average gene expression of each gene
  vector[G] alpha_raw;                 // Non-centered parametrization for the logarithm of the average gene expression value for each gene
  matrix[G, C] beta_raw;               // Non-centered way of generating changes not explained by TRs
  vector<lower=0>[C] sigma_beta;       // Standard deviation for changes not explained by TRs
  real<lower=0> sigma_TRact;           // Standard deviation for TR activity
  matrix[R, C] TRact_raw;              // Non-centered parametrization for TR acctivity
  real<lower=0> inverse_phi;           // Parameter that will be transformed into the dispersion of the negative binomial
  // row_vector[S] log_norm_factors;
}

transformed parameters {
  vector[G] alpha;                     // Logarithm of the average gene expression for each gene accross all conditions
  matrix[G, C] beta;                   // Change not explained by TRs in the logarithm of expected average gene expression by condition 
  matrix[R, C] TRact;                  // TR activity on each condition
  matrix[G, C] TReffect;               // Effect of TRs by condition on the logarithm of expected average gene expression                                         
  matrix[G, C] logFC;                  // Change on expected average gene expression by condition                                                    
  matrix[G, S] X;                      // Logarithm of the expected average gene expression
  real<lower=1> phi;                   // Dispersion parameter of the negative binomial
  alpha = 8 + mu_alpha + 0.5 * sigma_alpha * alpha_raw;
  TRact = 0.5 * TRact_raw * sigma_TRact;
  beta = 0.5 * diag_post_multiply(beta_raw, sigma_beta);
  TReffect = reg_matrix * TRact;
  logFC = beta + TReffect;
  phi = 1 + 1/(inverse_phi)^2;
  X = rep_matrix(alpha, S) + (logFC/2) * design2;
  // X = rep_matrix(alpha, S) + beta * design2 + rep_matrix(log_norm_factors, G);
}

model {
  mu_alpha ~ std_normal() ;
  sigma_alpha ~ std_normal() ;
  to_row_vector(TRact_raw) ~ std_normal();
  sigma_TRact ~ exponential(1);
  to_row_vector(beta_raw) ~ std_normal();
  sigma_beta ~ exponential(1);                            
  inverse_phi ~ exponential(1);
  // log_norm_factors ~ normal(0, 0.05);
  for (i in 1:G) {
    expression[i, ] ~ neg_binomial_2_log(X[i, ], phi);
  }
}

generated quantities {
  // pointwise log-likelihood for LOO
  matrix[G, S] log_lik;
  // replications for posterior predictive checks
  int<lower=0> expression_pred[G, S];
  for (i in 1:G) {
    for (j in 1:S) {
      log_lik[i, j] = neg_binomial_2_log_lpmf(expression[i, j] |  X[i, j], phi);
      expression_pred[i, j] = neg_binomial_2_log_rng(X[i, j], phi);
    }
  }
}

When I try to sample from this simple model, I always find a lot of problems (using the r code from below):

Warning messages:
1: There were 7998 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 15. See
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded 
2: Examine the pairs() plot to diagnose sampling problems
 
3: The largest R-hat is 2.77, indicating chains have not mixed.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#r-hat 
4: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess 
5: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess

In order to solve the problems and try to check the computational faithfullness of the model, I am generating data using Stan following exactly the same assumptions used as priors in the model (sample some expression values from the priors). I sample some data values from this sampling model based on prior specification of my main model (using rstan) and try to infer the known values that have generate such sample.


This would be the r code used to generate the data. (I also have tried initializing the parameters at its real values (variable initialization) but the sampling is not improved…

library(rstan)

# Defining the data for sampling from the model
G <- 92
R <- 10
design <- rbind(c(0, 0, 0, 0, 1, 1, 1, 1), c(0, 0, 1, 1, 0, 0, 1, 1))
S <- dim(design)[2]
C <- dim(design)[1]
# Creation of a random regulatory matrix of the required characteristics
# (only one regulator per gene)
reg_matrix <- matrix(0, ncol = R, nrow = G)
set.seed(321)
index <- sample(c(1:10), 92, replace = TRUE, prob = rep(1/10, 10))
for (i in 1:92) {reg_matrix[i, index[i]] <- 1}
reg_matrix <- as.data.frame(reg_matrix)
# Data list for sample generation
simu_data <- list("G" = G, "S" = S, "C" = C, "design" = design, 
                  "R" = R, "reg_matrix" = reg_matrix)
# Sampling with Stan
fit_ensemble <- stan(file=toy_example_ensemble.stan,
                     data=simu_data, iter=10000, warmup=0, chains=1, 
                     refresh=500, seed=4838282, algorithm="Fixed_param")
# Extracting samples to infer with Stan
simu_expression <- extract(fit_ensemble)$expression
expression_real <- as.data.frame(simu_expression[1,,])
# Extracting true parameter values for initialization
simu1 <- extract(fit_ensemble, permuted = FALSE)
init_points <- as.list(simu1[1, 1, ])
initialization <- list(init_points, init_points, init_points, init_points)
# Fitting the model
input_data <- list("G" = G, "S" = S, "C" = C,
                   "R" = R, "reg_matrix" = reg_matrix,
                   "expression" = expression_real,
                   "design" = design)
options(mc.cores = 4)
mc.cores = parallel::detectCores()
fit <- stan(file=toy_example.stan,
            data=input_data, seed=493848, iter = 4000, 
            refresh = 400, chains = 4, init = initialization,
            control = list(max_treedepth = 15))
} 

The model for sampling is following every assumption from the model above (even generating data using non-centered parametrization, which I guess makes no difference at all but I just tried anything I could come up with)


toy_example_ensemble.stan

functions {
  /**
   * Alternative to neg_binomial_2_log_rng() that avoids potential
   * numerical problems during warmup
   *
   * @param eta Mean parameter for NB distribution
   * @param phi Dispersion parameter such that Variance=mean+(1/phi)*mean^2
   * @return int Value sampled from NB(eta, phi)
   */
  int neg_binomial_2_safe_rng(real eta, real phi) {
    real gamma_rate;
    if (phi / eta == 0)
      return -9;
    gamma_rate = gamma_rng(phi, phi / eta);
    if (gamma_rate >= exp(20.79))
      return -10;
    if (gamma_rate == 0)
      return -11;
    return poisson_rng(gamma_rate);
  }
}

data {
  int<lower=1> G;                                                 // Number of genes
  int<lower=1> S;                                                 // Number of samples
  int<lower=1> C;                                                 // Number of conditions/covariables/interactions/batch effects
  int<lower=1> R;                                                 // Number of regulators
  matrix<lower=0,upper=1>[C, S] design;                           // Design matrix (Conditions, interactions, batch effects)
  matrix<lower=0, upper=1>[G, R] reg_matrix;                      // Regulation matrix

}

transformed data{
  matrix<lower=-1,upper=1>[C, S] design2 = 2 * design - rep_matrix(1, C, S);
  // Modification of the design matrix in order to index change 
  // so that change in average condition is added in condition 1 
  // and subtracted in condition 2 to the average sample condition
}

generated quantities {
  real  mu_alpha;                 // Average value of the distribution of the logarithm of average gene expression of each gene
  real<lower=0>  sigma_alpha;     // Standard deviation of the distribution of the logarithm of the average gene expression of each gene
  vector[G] alpha;                // Logarithm of the average gene expression for each gene accross all conditions
  vector[G] alpha_raw;            // Non-centered parametrization for the logarithm of the average gene expression value for each gene
  matrix[G, C] beta_raw;          // Non-centered way of generating changes not explained by TRs
  vector<lower=0>[C] sigma_beta;  // Standard deviation for changes not explained by TRs
  matrix[G, C] beta;              // Change not explained by TRs in the logarithm of expected average gene expression by condition 
  real<lower=0> sigma_TRact;          // Standard deviation for TR activity
  matrix[R, C] TRact_raw;         // Non-centered parametrization for generating TR acctivities
  matrix[R, C] TRact;             // TR activity on each condition
  matrix[G, C] TReffect;          // Effect of TRs by condition on the logarithm of expected average gene expression 
  matrix[G, C] logFC;             // Change on expected average gene expression by condition
  matrix[G, S] X;                 // Logarithm of the expected average gene expression
  matrix[G, S] expression;        // Gene expression
  real<lower=0> inverse_phi = exponential_rng(1);    // Generation of a parameter that will be transformed into the dispersion of the negative binomial
  real<lower=1> phi = 1 + 1/(inverse_phi)^2;         // Dispersion parameter of the negative binomial
  sigma_alpha = fabs(normal_rng(0, 1)); 
  mu_alpha = normal_rng(0, 1);
  for (i in 1:G) {
    alpha_raw[i] = normal_rng(0, 1);
  }
  for (i in 1:G) {
    alpha[i] = 8 + 0.5 * sigma_alpha * alpha_raw[i];
  }
  for (i in 1:G) {
    for (j in 1:C) {
      beta_raw[i, j] = normal_rng(0, 1);
    }
  }
  for (i in 1:C) {
    sigma_beta[i] = fabs(normal_rng(0, 1));
  }
  beta = 0.5 * diag_post_multiply(beta_raw, sigma_beta);
  sigma_TRact = fabs(normal_rng(0, 1));
  for (i in 1:R) {
    for (j in 1:C) {
      TRact_raw[i, j] = normal_rng(0, 1);
    }
  }
  TRact = 0.5 * sigma_TRact * TRact_raw;
  TReffect = reg_matrix * TRact;
  logFC = beta + TReffect;
  X = rep_matrix(alpha, S) + (logFC/2) * design2;
  for (i in 1:G) {
    for (j in 1:S) {
      // expression[i, j] = neg_binomial_2_log_rng(X[i, j], phi);
      expression[i, j] = neg_binomial_2_safe_rng(exp(X[i, j]), phi);
    }
  }
}

I started with a max_treedepth of 10 but I have increased it until 15 with no success… I thought about initialization but that didn’t work either. I used different parametrization for alpha without better results. I think that the real problem may be due to my inadequate understanding of the sampling process and how the geometry of my problem affects it. Any help with this would indeed be much appreciated. I ran out of possible solutions.
Thank you for your time and sorry for the length of the message.

This is a good test. If it doesn’t work on simulated data, then something in the model needs to change.

That means identifying what is going wrong and doing something to fix it (in some order).

Since you’re already doing simulated data, things you can try:

  1. Look at the posterior correlation matrix. If you’re hitting max_treedepth, that can be nonidentifiabilities, and you can find some of those by looking at correlations

  2. Check traceplots of lp__. Are your chains sampling similar lps or are some of them wildly different? If they’re different, maybe your posterior is multimodal.

  3. Hold some parameters constant. Scale parameters (standard deviations and such) are usually harder to estimate than means. If you can hold some parameters constant and do your inference, then it’s possible to you can tighten your priors and get somewhere. Do some exploring here – sometimes a one parameter model is easier than a two parameter model and sometimes the opposite.