Estimating parameters from linear combination of 2 indepenedent poisson random variables

Hi,

I’m trying to model a generative process where I have 2 independent random variables, such that:

X_1 \sim Poisson(\lambda_1)
X_2 \sim Poisson(\lambda_2)
Y = a_1X_1 + a_2X_2

I would like to create a model to estimate \lambda_1 and \lambda_2 from observed outcome Y and observed vectors of data a_1 and a_2.

R code to simulate the generative process would look like this:

library(rstan)
set.seed(12345)

N <- 2000 #the number of simulated samples

a1 <- rnorm(N) #vectors of real numbers
a2 <- rnorm(N)

lambda1 <- 3 #The parameters we will try to recover
lambda2 <- 10

y <- (a1 * rpois(N, lambda1)) + (a2 * rpois(N, lambda2)) #create simulated y, a linear combination of random draws from these two independent Poisson distributions

In short, I’m a little confused where to start. I don’t think this is a mixture model (log_mix()…), as a mixture would have either one of the two distribution for each observation—not a combination of both.

I think solving this problem would involve deriving a new PDF for a new “linear combination of poissons” distribution and specifying that as a new “user defined” PDF in Stan?

Does this sound correct, or is there something I’m missing here? With sufficient data I think \lambda_1 and \lambda_2 should be estimable, but I’m not sure there exists a straightforward way of implementing the correct “linear combination” model…

(Still new to Stan)

2 Likes

So here is my 5 min first stab at it. You could consider your X_1,X_2 Poisson realisations a latent variables which are Poisson distributed with parameters \lambda_1,\lambda_2 respectively. We could then model y as defined by your formula + some gaussian noise added. It’d look something like this.

data {
  int N;
  real a1[N];
  real a2[N];
  real y[N];
}
parameters {
  real<lower=0> lam1;
  real<lower=0> lam2;
  real<lower=0> sig;
  real  p1[N];
  real  p2[N];
}
model {
  p1~poisson(lam1);
  p2~poisson(lam2);
  y ~ normal(a1*p1+a2*p2,sig);
}
2 Likes

It is great you provided code to simulate data, that really helps. I however fear your problem is not really solvable, but maybe that just means that the specification ignores something about the data generating process…

First, your simulation uses independent a1 and a2 for each observation, so you have only one observation to determine two numbers. That just can’t work.

Assuming a1 and a2 are the same for all observations, things get weird, because you multiply the integer X1, X2 with real a1 and a2. But for most a1 and a2 there is only one set of integers that would satisfy Y_i = a_1X_{1,i} + a_2X_{2,i} for all i

But that is quite a weird model maybe you meant X_1 \sim Poisson(a_1\lambda_1) (which is what people mostly assume with Poisson models), but here you cannot distinguish the individual components, as sum of Poisson variables is still a Poisson sou you can only determine the combination \bar{\lambda} = a_1\lambda_1 + a_2\lambda_2 but not the individual components.

I would guess that some of the assumptions in the model are wrong, finding them would then let you make a better model.

Hope that is making some sense.

Also, unfortunately @emiruz’s approach will not work because we can’t have discrete latent variables (and in this case, it is impossible to marginalize within the Stan program) and substituting real values for discrete would make it a different (and possibly quite weird) distribution.

2 Likes

Ah indeed, @martinmodrak is right. p1 and p2 in my cartoon model should actually be of type int, which doesn’t work in Stan. However, when \lambda>1000 ish it should be possible to approximate poisson(\lambda) with normal(\mu=\lambda,\sigma^2=\lambda). Maybe that’d rescue it but maybe not for all the reasons Martin mentioned :/

PS: apparently if \lambda>10 a similar approximation can work with a continuity correction.

1 Like

Hi Martin and Emiruz,

Martin I see based on your response, that in attempting to simplify this problem, I probably left out critical information.

In reality, a_1 and a_2 are proportions (0 < a_1 <1 and 0 < a_2 <1) and a_2 = 1 - a_1.

I think the generative process is actually more accurately described by this code (where a_1 and a_2 are bounded by 1 and 0):

library(rstan)
set.seed(12345)

# the number of smaples
N <- 2000

# create a simulated vector of proportions between 0 and 1
a1 <- rnorm(N)
a1 <- (a1 - min(a1)) / (max(a1 - min(a1)))
a2 <- 1-a1 #a2 is actually the reciprocal of a1

# The parameters we will try to recover.
lambda1 <- 3
lambda2 <- 10


# create simulated data, mixing values drawn from these two normal distributions
y <- (a1 * rpois(N, lambda1)) + (a2 * rpois(N, lambda2))

Emiruz I initially thought your solution looked great, but I also ran into the problem of not being able to use int.

I’m not sure if this new information changes things or points to a solution?

I must apologize I was trying to simplify things but it seems I went too far…

I think you still need to say more because the model is still extremely flexible - you have only one observation for each a1 and you have the two lambda’s on top - so still more parameters than data. So unless you can somehow tie the elements of a1 together (e.g. that they tend to be similar or that neigbouring values are similar) you won’t be able to learn much from your data. (also note that you can use runif or rbeta to sample numbers between 0 and 1.

EDIT: maybe it would help if you described the actual scientific question you are aiming to solve.

So this piqued an interest and a fell into the rabbit hole … Here goes some rambling.

Your model is equivalent to \frac{Y}{a_1}=X_1+\frac{a_2}{a_1}X_2. So I don’t need to keep writing that down I’ll say Z= X_1+cX_2. If c=\frac{a_2}{a_1}>0 and c is an integer are acceptable conditions then we might be in business :-)

Now, X_1 \sim poisson(\lambda_1) but it turns out there’s a closed form distribution for a scalar multiple of a Poisson variable given by :

P(X=x\mid\lambda,\alpha)=\frac{\lambda^{x/\alpha}}{(z/\alpha)!}e^{-\lambda}

according to this guy.

It’s the case that P(X_1,X_2\mid c)=P(X_1=x_1)P(X_2=x_2\mid c) and so the PDF we need is P(Z=z\mid \lambda_1,\lambda_2,c)=\sum_{x_1=0}^z\frac{\lambda_1^{x_1}e^{-\lambda_1}}{x_1!}\frac{\lambda_2^{x_2/c}}{(x_2/c)!}e^{-\lambda} where x_2=(z-x_1)/c and x_2 is a positive integer.

If you were to implement the log version of this as special_lpdf (and the formula is correct) then your model could be:

data {
  int N;
  int z[N];
  int<lower=0> c[N];
}
parameters {
  real<lower=0> lam1;
  real<lower=0> lam2;
}
model {
  for (n in 1:N)
    target+=special_lpdf(y[n]|lam1,lam2,c[n])
}

caveat emptor: these are just thoughts. not my direct experience so take it for what it’s worth.

2 Likes

Hi Martin,

The data is expression data for the expression level of a gene, for about 1000 samples. The expression value (y) are read counts, i.e. the number of reads that mapped to the gene from a sequencer. The expression data is generated from a mixture of 2 underlying cell types, cancer and normal, and I know the proportions (a1 and a2) or these 2 cell types in each of the 1000 samples. I’m trying to estimate the underling expression distribution of the 2 cell types (e.g. the mean expression and variance of the gene in cancer and normal)…

Interestingly, this works perfectly well if I could assume the data is normally distributed.

This is a similar generative process, but swaps normal distributions for poissons.

# simulated proportions.
N <- 1000
a1 <- runif(N)
a2 = 1 - a1

# Simulated Parameters we want to recover
mu1 <- 5
sigma1 <- 1
mu2 <- 10
sigma2 <- 2

# simulated data (creating a linear combination of normals instead of poissons)
y <- (a1 * rnorm(N, mu1, sigma1)) + (a2 * rnorm(N, mu2, sigma2))

# model fit
stan.fit <- stan(file="estimate_normals.stan", 
                 data = list(y=y, N = length(y), a1=a1, a2=a2
                 ), 
                 chains=3, 
                 iter=2000, 
                 init=0);

Because of the convenient properties of normal distributions, the parameters can be estimated very accurately with this model:

data{
  int N;
  real y[N];
  real a1[N];
  real a2[N];
}
parameters {
  real mu;
  real<lower = 0> sigma;
  real mu1;
  real<lower = 0> sigma1;
}
model {

  for(i in 1:N){
    target += normal_lpdf(y[i] | ((a1[i]*mu) + (a2[i]*mu1)), sqrt((a1[i]^sigma)^2 + (a2[i]*sigma1)^2));
  }
}

The estimated parameters are quite accurate and the intervals capture the true values:

           mean          se_mean
mu         5.0598589 0.001387755
sigma      0.7293268 0.005423250 
mu1        9.9866638 0.002223313 
sigma1     1.9287824 0.002192146

I guess it seems there may not be such an easy solution if the data are poisson distributed, rather than normally distributed…

Hmm very interesting, so it seems a user defined pdf may actually be the answer…

If you google “linear combination of poisson variables” folks have done all sorts of derivations for this under different constraints . My one above is pretty Mickey Mouse in comparison .

Sorry to put my 2 cents here. You can scale the rate \lambda of a poisson.
So use x = a * \lambda_1 + (1-a) * \lambda_2, with Y ~ poisson(x).
Since the variances comes into play, I’d recommend using a negative binomial distribution.

Its called “Thinning or splitting a Poisson process” see here: http://www.randomservices.org/random/poisson/Splitting.html

Added: I assume both cell types can occur in one sampling, so it’s not exclusive.

4 Likes

Hi Andre,

If I’ve understood correctly, the results of your proposed approach looks to recover the correct parameter estimates:

R code for the generative process:

# the number of smaples
N <- 1000

# create a simulated vector of proportions between 0 and 1
a1 <- runif(N)
a2 <- 1-a1

# The parameters we will try to recover.
lambda1 <- 3
lambda2 <- 10

# create simulated data, mixing values drawn from these two normal distributions
y <- (a1 * rpois(N, lambda1)) + (a2 * rpois(N, lambda2))
y1 <- round(y)

nmfOut <- stan(file="stanTest_poissons.stan", data = list(y=y1, a1=a1, a2=a2, N=1000), chains=3, iter=2000, init=0)
summary(nmfOut)$summary

Stan code for the poisson or negative binomial model (that I think I’m correctly interpreting from your post):

data {
  int N;
  vector[N] a1;
  vector[N] a2;
  int y[N];
}
parameters {
  real<lower=0> lam1;
  real<lower=0> lam2;
  real<lower=0> phi;
}
model{
  
  //y ~ poisson(a1*lam1 + a2*lam2);
  y ~ neg_binomial_2(a1*lam1 + a2*lam2, phi);
}

The outputted parameter estimates:

             mean      se_mean           
lam1     3.024426 6.313266e-03 
lam2    10.223094 9.154772e-03 
phi  51588.353110 1.858228e+03 

This looks good, although I’m still somewhat not clear on why this works… But I’ve some reading to do…

Thanks a lot for the pointer!!!

1 Like

If you have access to Mathematica:

1 Like

@andre.pfeuffer solution doesn’t correspond to your generating function. I.e. Z+Y where Z\sim poisson(a_1\lambda_1) and Y\sim poisson(a_2\lambda_2) is not equivalent to a_1X_1 + a_2X_2 where X_1\sim poisson(\lambda_1), X_2\sim poisson(\lambda_2).

The reason that they are not equivalent is that \alpha X is not Poisson distributed where X\sim poisson(\lambda) because E[\alpha X]\ne Var[\alpha X] as required by the Poisson distribution.

However, although Var[Z+Y]\ne Var[a_1 X + a_2 X] what’s interesting is that:

E[Z+Y] = a_1\lambda_1 + a_2\lambda_2 = E[a_1X_1+a_2X_2]

and hence I think you can use one form to recover the mean from the other: an impressive insight from @andre.pfeuffer !

Note also (IMHO), that the above is usually true for mixtures of Poisson as described by @andre.pfeuffer when the mixture ratios a_1,a_2 are constant for every data point, but in your case a_1,a_2 are available for every data point and \lambda_1,\lambda_2 in a_1\lambda_1 + a_2\lambda_2 end up well identified because they stay the same for every data point. This given, had your model ran with just 1 row of data it wouldn’t be identifiable but for two or more it should be!

caveat emptor: IMHO and according to my calculations from the above!

1 Like

Thanks @emiruz and @andre.pfeuffer ,

Yes my thoughts were similar as regards why I thought this solution “shouldn’t” work, so hadn’t tried it…

Is it possible that Mathematica could estimate the actual PDF for the “true” generative process using that function? (i.e. moving “a” outside the square brackets)…

I guess this variance issue probably poses some limitations, but I guess its not exactly clear to me what they are at this point…

If you’re sure your RVs are Poisson then because the expectation is the same for the two formulas you can use one to find the expectation of the other . And since for Poisson distributions the expectation is also the only parameter of interest
you should be good to go. A happy coincidence.

Caveat emptor: IMHO only.

1 Like

OK, so this makes much more sense. I actually do work a lot with models for RNA-seq data :-) Is that RNA-seq? Or (as I would guess from the number of samples) is it Sanger or other platform? In the following I assume RNA-seq (as I am more familiar with it), but most of the suggestions would IMHO apply elsewhere as well.

Anyway, the distribution of Z_1 = a_1 X_1 where X_1 \sim Poisson(\lambda_1) is a weird one and I am almost 100% sure you don’t want to use it (also - nobody in the whole field AFAIK uses it). Let’s say a_1 = \frac{1}{2} then Z_1 has support on integers and half-integers. What would that mean? Also the sequencer gives you integer counts so you have no way of producing half-integers. (and things get weirder for situations like a_1 = \sqrt{2}).

As @andre.pfeuffer mentioned most people would use neg. binomial for untransformed RNA-seq data. There is some debate on whether you shouldn’t instead transform the data and then use continuous distributions, but neg. binomial is definitely one of top contentenders and I have good experience with it.

As @emiruz noted, if a_1 and a_2 is known it is possible to consider the mean of the neg. binomial as a combination of two parameters and the model would be identifiable in most cases.

If you haven’t already done that, I think reading the DESeq2 paper is reasonably instructive on some basic RNA-seq modelling (although they run it in an frequentist setting). Unless you totally know what you are doing, you definitely want to start with something similar to their model (because it works really well). They also handle normalization and batch effects - which you almost certainly also have to face. I have a (slightly messy) implementation of the DESeq2 model in Stan based on some of @stemangiola’s code which you might want to steal from (though credit would be nice if you end up using it).

The task you want to do is usually called “deconvolution” in the RNA-seq literature and there is a bunch of software for it, so you might want to see if there isn’t something already tested and available, but I have no personal experience with any of those. I also know that @stemangiola is working on ARMET - an RNA-seq deconvolution model in Stan (the problem he solves is however different).

I am all for using Stan to analyze expression data! But be prepared that efficiency can be a major roadblock.

Best of luck with your model!

3 Likes

Hi Martin,

Yes this is RNA-seq data alright (Illumina platfrom). Thanks again for the other pointers, I will follow up on these links!

1 Like

I did a model time ago with this goal: deconvolving transcript abundance from a mix knowing cell type proportion (right hand side), with some success.

image

I think two solutions are

  • approximate to a log_normal (maybe with 0 inflation), and use a mix of normal distributions
  • infer with NB natively inferring the means from the mix and keeping the overdispersion as single parameter with a linear relationship with the mean

A good way to link overdisperion and mean, given that overdisperion = 1/exp(sigma) and mean = exp(log_mean), is

sigma = negative_slope * log_mean + c + error

P.S. @martinmodrak suggested a good way to approximate sums of NB, but I don’t think we cracked weighted sum of NB, like we have in case on convoluted mixtures. So keeping the overdispersion as one parameter will simplify the matter.

1 Like

Reading again my response, I think it was a bit of a put-down which I didn’t intend, so to provide some positive concrete steps, I would:

  1. use DESeq2’s estimateSizeFactors to get normalization constants s[i]
  2. fit using X[i] ~ neg_binomial_2(s[i] * (a[i] * lambda1 + (1 - a[i]) * lambda2), phi) where phi could be treated as a parameter or you could use DESeq2’s estimateDispersions to estimate it before fit.

Also I find it quite weird that you have 1000s of samples but apparently only model one gene from each sample. But maybe that was part of the simplification you made for the forums?

I really want to see you succeed with that, so feel free to get back (or start a new topic) if you get stuck.

2 Likes