Initialization between (-2, 2) failed after 100 attempts in Negative Binomial Distribution

Hi everyone,

We are running into some issues with initializing our negative binomial distribution (log alternative parameterization). We are trying to model the counts of transcripts (in RNASeq, etc.) using a negative binomial distribution, where the LFC follows a normal distribution and the variance follows a Cauchy distribution.

However, we get the “Initialization between (-2, 2) failed after 100 attempts” error message:
“Initialization between (-2, 2) failed after 100 attempts.
Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.”

We tried various things, including limiting our Cauchy distribution:
reciprocal_phi ~ cauchy(0., 5.); -->
reciprocal_phi ~ cauchy(0., 1.);

and also specifying an init_r of 0.1, but we are still getting the error message. There’s probably something pretty fundamental we are missing, would appreciate any suggestions.

Our transcripts count data is a matrix of 5 genomic regions X 10 samples (as an initial trial run). None of the rows or columns sum to zero. Another strange thing is that I am also running the full counts matrix of 20k genomic regions X 10 samples and it seemed to get over the initialization…?

Thank you! Posting the stan code and the python code below:


Stan:

data {
  int<lower=0> N;    // number of samples
  int<lower=0> D;    // number of dimensions
  int<lower=0> p;    // number of covariates
  real depth[N];     // sequencing depths of microbes
  matrix[N, p] x;    // covariate matrix
  int y[N, D];       // observed microbe abundances
}

parameters {
  // parameters required for linear regression on the species means
  matrix[p, D-1] beta;
  real reciprocal_phi;
}

transformed parameters {
  matrix[N, D-1] lam;
  matrix[N, D] lam_clr;
  matrix[N, D] prob;
  vector[N] z;
  real phi;

  phi = 1. / reciprocal_phi;

  z = to_vector(rep_array(0, N));
  lam = x * beta;
  lam_clr = append_col(z, lam);

}

model {
  // setting priors ...
  reciprocal_phi ~ cauchy(0., 1.);
  for (i in 1:D-1){
    for (j in 1:p){
      beta[j, i] ~ normal(0., 1.); // uninformed prior
    }
  }
  // generating counts
  for (n in 1:N){
    for (i in 1:D){
      target += neg_binomial_2_log_lpmf(y[n, i] | depth[n] + lam_clr[n, i], phi);
    }
  }
}

Python code:

# Load the data
reads_array = pd.read_csv('/mnt/ceph/users/dli/atacseq/pca_clustering/counts_universe_gut/subset_counts.csv', sep='\t', index_col=0)
metadata = pd.read_csv('/mnt/ceph/users/dli/atacseq/pca_clustering/counts_universe_gut/metadata.csv', sep=' ', index_col=0)

formula = 'C(HistoStandard) + C(Batch) + C(Diagnosis)'
X = dmatrix(formula, metadata, return_type='dataframe')
Y = reads_array.T

data = {
    'N' : X.shape[0],
    'D' : reads_array.shape[0],
    'p' : X.shape[1],
    'depth' : np.log(reads_array.sum(axis=0).values),
    'x' : X.values,
    'y' : Y.values.astype(np.int64)
}

sm = pystan.StanModel(model_code=stan_model)
fit = sm.sampling(data=data, iter=1000, chains=4, init_r=.1, warmup=500)
res = fit.extract(permuted=True)

The parameters for the negative binomial distribution need to be positive. In the case of phi, you can tell Stan that phi must be positive by writing real<lower=0> reciprocal_phi instead of what you currently have.

Thank you! I tried now:

transformed parameters {
  matrix[N, D-1] lam;
  matrix[N, D] lam_clr;
  matrix[N, D] prob;
  vector[N] z;
  real<lower=0> phi;

...

but now it also complains about the phi in addition to still not initializing:

“Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: validate transformed params: phi is -13.8702, but must be greater than or equal to 0 (in ‘unknown file name’ at line 22)”

I think the phi value must be a consequence of how it initializes rather than a deciding factor?

You should define phi in the parameters block directly:

parameters {
real<lower=0> phi;

Or, if you want to work with reciprocal_phi, put the constraint on it in the parameters block:

parameters {
real<lower=0> reciprocal_phi;

It won’t work to put the bound in transformed parameters.

Hello!

Thanks for the suggestions!

Tried a bunch of things:

parameters {
  //stuff
  real<lower=0> reciprocal_phi;
  real<lower=0> phi;
}

transformed parameters {
  //stuff  
  phi = 1. / reciprocal_phi;
  //stuff
}

^ Complains about Cannot assign to variable outside of declaration block; left-hand-side variable origin=parameter, Illegal statement beginning with non-void expression parsed as phi

parameters {
  //stuff
  real<lower=0> reciprocal_phi;
  real<lower=0> phi;

  phi = 1. / reciprocal_phi;
}

^ Complains about:

"SYNTAX ERROR, MESSAGE(S) FROM PARSER:
 error in 'unknown file name' at line 17, column 2
  -------------------------------------------------
    15:   real<lower=0> phi;
    16:
    17:   phi = 1. / reciprocal_phi;
         ^
    18: }
  -------------------------------------------------

PARSER EXPECTED: <one of the following:
  a variable declaration, beginning with type,
      (int, real, vector, row_vector, matrix, unit_vector,
       simplex, ordered, positive_ordered,
       corr_matrix, cov_matrix,
       cholesky_corr, cholesky_cov
  or '}' to close variable declarations>"

And if I just do

parameters {
  //stuff
  real<lower=0> reciprocal_phi;
}

transformed parameters {
  //stuff  
  real<lower=0> phi;

  phi = 1. / reciprocal_phi;
  //stuff
}

it runs but still runs into the same initialization errors (even with trying different init_r:

Initialization between (-1, 1) failed after 100 attempts.
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

I recommend just forgetting about reciprocal_phi. Just define phi in the parameters block. Also give phi an informative prior that makes sense in the context of your problem, and all should work, I’d think.

Thanks for all the suggestions! We figured it out, it was because the depths of some of our samples was -Inf, ie, some of our samples did not have any counts in any of the selected 5 genomics regions and we calculate the depth as log(sum))

We selected other genomic regions s.t. there was at least one count of any of the genomic region in every sample and ran the script again and it sampled successfully.