Fit convolution of negative binomial distributions/ Can't define integer in parameter block

Hi,

what I am basically trying to do is to model a process in which I observe Y which is a convolution of c negative binomials:
Y=Y_1+...+Y_c ,
where each random variable Y_i \sim NB(\alpha_i, p_i) .

As described in this Paper (https://www.sciencedirect.com/science/article/abs/pii/S0167715206002185), the distribution of Y \sim NB(\alpha + K, p_{max} ) is "mixture negative binomial with mixture parameter \alpha + K and p_{max}", where
\alpha = \alpha_1 + ... + \alpha_c
p_{max} = max_i(p_i)

K is an integer random variable with pmf:
pr_k = R\delta_k, k=0,1…
where:
q_i = 1-p_i
R = \displaystyle\prod_{j=1}^c \left( \frac{p_{max} q_j}{q_{max}p_j} \right)
\delta_{k+1} = \frac{1}{k+1} \displaystyle\sum^{k+1}_{i=1} i \xi_i \delta_{k+1-i} , with \delta_0 = 0
and \xi_i = \displaystyle\sum_{j=1}^n \frac{\alpha_j(1-q_{max}p_j / q_jp_{max})^i}{i}


The stan model I managed to do so far is the following. The problem I encounter thereby is that I have the unkown k and y_i, i=1,...,c (for each observation n) that need to be integer values, but in the parameter or transformed parameter block (the way it is written now in the model) this is not possible.
I know that this has already been asked before, but the answers aren’t applicable to my model.

Is there a specific reason this is not allowed?

And might there be a work around for my model?

functions{
  // functions for the pmf of the random variable K  
   
  real R(real[] alpha_c, real[] prob_c, real prob_m){
  	real r = 0;
  	real beta_m = prob_m / (1-prob_m);
  	for(j in 1:size(prob_c)){
  	      r = r * ( (((1-prob_c[j]) / (prob_c[j])) * beta_m)^(-alpha_c[j])); 
  	}

	return r;
  }

  real epsilon(int i, real[] alpha_c, real[] prob_c, real prob_m){
    	real e;
    	real prob_m_s;
    	real beta_cj;
    
    	e = 0;
    	prob_m_s = (1-prob_m) / prob_m;
    
    	for(j in 1:size(prob_c)){
    	   beta_cj = prob_c[j] / (1-prob_c[j]);
    	   e += (alpha_c[j] * (1 - ( prob_m_s *  beta_cj ))^i ) / i;
    	}
    	return e;
  }
  
  real delta(int k, real[] alpha_c, real[] prob_c, real prob_m);
     
  real delta(int k, real[] alpha_c, real[] prob_c, real prob_m){
       if( k == 0) {
  	    return 1;
       }else{
      	   real d = 0;
      	   for(i in 1:k){ 
      	      d += i * epsilon(i, alpha_c,  prob_c,  prob_m ) * delta( k - i,  alpha_c,  prob_c,  prob_m); 
      	   }
      	d = d / k;
      	return d;
       }
  }

  // log pmf for K
  real param_k_log(int k, real[] alpha_c, real[] prob_c, real prob_m){
    	real p;
    	p = R(alpha_c, prob_c, prob_m) * delta(k ,alpha_c, prob_c, prob_m);
    	return log(p);
  }

}

data {
  int N; // # of trials
  int y[N]; // # of failures in trial n
  int <lower=1> C; // # random variables (cells) in each N
}


parameters{
  positive_ordered[C] beta_c;
  real<lower=0, upper=1000> alpha_c[C];  
  simplex [C] y_unscaled[N];
}

transformed parameters{
  real<lower=0, upper=0.999> prob_c[C]; // probability for each NB
  int<lower=0>[C,N] y_c; // random variables y_1,...,y_c that sum up to y
  real<lower=0, upper=0.999> prob_m;
  real<lower=0> beta_m;
  real<lower=0> alpha_m;
  int<lower=0> k; 
  
  for(c in 1:C){
     prob_c[c] = beta_c[c] / (beta_c[c] + 1);
  	
  	for(n in 1:N){
	     y_c[c,n] = y_unscaled[n,c] * y[n];
  	}
  }

  prob_m = max(prob_c);
  beta_m = prob_m / (1-prob_m);
  alpha_m = sum(alpha_c);
}


model{
   for( c in 1:C){
      y_c[c] ~ neg_binomial(alpha_c[c], beta_c[c]);
   }

    k ~ param_k(alpha_c, prob_c,prob_m );
    y ~ neg_binomial(alpha_m + k, beta_m);
}


Data can be generated in R with the following code:

y <-  integer(N)
C <- 2
N <- 1000
alpha <- c(20,100)
prob <- c(0.9,0.8)
  for(c in 1:C){
    y <- y +  rnbinom(n=N, size=alpha[c], prob = prob[c] )
  }

I would very much appreciate any help, I have been stuck with this for quite a while.

Thanks!

1 Like

I just so happen to have been working on this problem recently with @stemangiola :-) I have a writeup of the results on my blog: https://www.martinmodrak.cz/2019/06/20/approximate-densities-for-sums-of-variables-negative-binomials-and-saddlepoint/

The TLDR is that a simple approximation with a single NB density found by matching moments (the mean and variance) of the sum works quite well for the inferential targets I could conceive. A more involved approximation can do somewhat better sometimes.

With the neg_binomial_2 paremetrization in Stan we have

Y_i \sim NB_2(\mu_i, \phi_i), E(Y_i) = \mu_i, Var(Y_i) = \mu_i + \frac{\mu_i^2}{\phi_i}

and the moments approximation is then

Y \simeq \bar Y \\ \bar Y \sim NB_2(\bar\mu,\bar\phi) \\ \bar \mu = \sum \mu_i\\ \bar \phi = \frac{(\sum \mu_i)^2}{\sum\frac{\mu_i^2}{\phi_i}}

Where \bar \mu and \bar \phi are determined by solving

E(\bar Y) = E(Y) = \sum E(Y_i)\\ Var(\bar Y) = Var(Y) = \sum Var(Y_i)

So this should be easy to extend to your favorite parametrization of the NB. See the blog for more details.

The reason is that unfortunately nobody in the world knows how to reliably combine Hamiltonian Monte Carlo with discrete random variables. Not for lack of trying, it is just that the problem turns out to be really difficult and there probably isn’t a general solution for all problems.

Alternatively, if you can put some fixed bounds K_{min}, K_{max} on K so that values of K outside [K_{min},K_{max}] contribute little to the final density, you could marginalize K i.e. have a simplex variable of size K_{max} - K_{min} that would carry the probability that K attains any of the values and then do a weighted average of the densities for individual K within the interval. I however don’t think this has much chance of working because even computing \delta_{k+1} is a nasty recursion infeasible for all but the smallest k. Also you would still be in the realm of approximations so might as well use one of the two I show in my blog which are cheaper to compute

Finally, I would be glad to know what your area of application is. I was considering doing a more formal writeup of my experiments with the sum of NBs and decided against it as it seemed there are not many actual use cases, so I’d like to learn about yours, it might push me to investigate a bit further :-)

Hope that helps!

If your model works when k is specified you could also run your model for varying values of k and then use model comparison to decide between them. This will depend on how viable model comparison is for your use case.

If k is a nuisance parameter for you, then you could also just pick a big value for k and use that since a bigger value will never make your model worse.

Otherwise, HMC, NUTS like differentiable models and discrete parameters don’t play well there. The Gibbs based samplers don’t have the same aversion (e.g. BUGS) but they are also much slower and more impractical.

Thank you very much for your fast response!

The problem is that I need to fit the parameters \alpha_i and p_i for each one of the underlying NBs rather than for the observed Y (or an approximation) as it is done in your model. (Sorry if I didn’t explain it clearly before).
Do you happen to have any experience on that?

What we are working on is gene expression data. Our random variable Y is the observed measurement and each observation is the sum of expression counts of a small number of cells. From these observations we would like to fit the NBs for each population the cells stemmed from.
For now I am assuming that each observation is the sum of exactly c cells and each of these cells stems from a different population.

As far as I can tell what you’re asking for is impossible. Negative binomial is an overdispersed Poisson and sum of Poisson variables is just a Poisson with rate equal to the sum of the individual rates. Only the sum is identifiable, no information about the individuals remains. Now, overdispersion complicates the story so maybe there is something that can be learned beyond the collective parameters but it’s not much.
You’re mushing together multiple cells and then measure the average gene expression level. Why do you think you’ll be able to disentangle them with a statistical model?

EDIT:
If we add nuisance parameters to represent the negative binomial as gamma-poisson mixture it’s possible to marginalize out the discrete parameters.

data {
  int N; // # of trials
  int y[N]; // # of failures in trial n
  int <lower=1> C; // # random variables (cells) in each N
}
parameters {
  positive_ordered[C] beta_c;
  real<lower=0> alpha_c[C];  
  real<lower=0> lambda[N,C]; // additional latent parameters
}
model {
  for (n in 1:N) {
    lambda[n] ~ gamma(alpha_c, beta_c);
    y[n] ~ poisson(sum(lambda[n]));
  }
}
1 Like

First - this sounds like almost exactly what @stemangiola is working on, although he was happy to work with the sum :-) Anyway, there might be possibility to share some insights.

I think @nhuurre is spot on. As you can see from the densities on the blog post I linked to, the density of the sum is veeeery close to single NB for many parameter combinations (and if the fraction mean/variance is constant, then this is exact), so most information is erased and the parameters become only very weakly identified. The model by @nhuurre is correct in principle but in practice your marginal posteriors for all alpha_c and beta_c would likely be almost identical and there would be some huge collinearites or worse, hindering sampling unless the dimension is very small.

I have been working for couple of years on transcriptional signal deconvolution https://github.com/stemangiola/ARMET. I am now in the process of completing the updated model that will go onto publication.

To approach your problem a prior information of the transcriptional signature would simplify the problem incredibly. If you don’t have prior information you need a lot of mixtures to try to identify both cell-specific signatures and proportionality.