Count data modeling with discrete parameters

I am a new user of Stan, and am fitting a count data, which I do not know what the distribution is. But:

Y = 2^(z+1) - alpha,
where
alpha ~ Poisson(lambda[z+1]) (corrected 07/10, thanks Bob)
and
z ~ poisson(theta)

Y is the count data (chromosome), z is a parameter represents the stage of this sample. Most of z should be 0,1,2,3… so most of Y should be 2,4,8,16… but occasionally we have 3,5,6,7,…15… and some other numbers. (when alpha !=0 so there is missing of chromosome)

I am defining a customized function Poipoi for distribution of alpha (to marginalize integer z). But since I do not know the distribution of Y ( I didn’t figure out how to do solve this analytically ), I am still having the issue of this integer parameter z.

Any suggestion how to model this kind of data? Thank you so much in advance!

Here is my code. It’s kind of messed up but you will know what I am trying to do. Thank you!!

functions {
      real Poipoi_lpmf(int alpha, int Z, vector[] lambda, real theta){
        vector[Z] lp;
        for(z in 1:Z)
          lp[z] = poisson_lpmf(z | theta) + poisson_lpmf(alpha | lambda[z]);
        return log_sum_exp(lp);
      }
    }

    data {
      int N;// total number of samples
      int Z; // total number of categories
      vector[N] Y; //number of  count
    }

    parameters {
      real<lower=0> theta;  //categorical distribution parameters (with length)
      vector<lower=0>[Z] lambda;
      int z[N];  // I know this is wrong
    }

    transformed parameters{
      vector<lower=0>[N] alpha;
      for (n in 1:N)
        alpha[n] = 2^(z[n]+1)-Y[n]
    }

    model {
      for(n in 1:N)
        alpha[n] ~ Poipoi(Z, lambda[z[n]+1], theta);
    }

I’m afraid I don’t quite know what you’re trying to do. You probably want to rethink in terms of where the error’s coming in from sampling or measurement. I have no idea what alpha is—presumably some latent count? And then theta comes out of the blue in your math and has no explanation.

You can’t have integer parameters—they need to be marginalized out. If you marginalize it out, you don’t need to include it as a parameter. The manual has a chapter on latent discrete parameters with three fully worked examples.

It helps if you can write out the density with latent discrete parameters, then you have an explicit form to marginalize. For instance, if z was really an integer indicating which mixture component something is drawn from, then we might have something like this:

for (n in 1:N) {
  z[n] ~ categorical(theta);
  alpha[n] ~ poisson(lambda[z[n]]]);
}

Thank you Bob! And yes, I figured that my problem description was probably not clear…
The model itself is exactly as what you wrote on the code (despite that z is poisson in my model, this is because most of the case, z is zero, since Y (observed) is usually 2) – and that’s why I named it (distribution of alpha) “Poipoi” distribution.

The problem is that I do not know the distribution of Y. I was thinking to transfer Y to alpha, since alpha~Poipoi. But alpha = 2^(z+1) -Y, and z is the discrete parameter, and I am not sure how I should marginalize it (I mean this part of the expression)

Again, any advise would be highly appreciated !
Also, please let me know if I made the question clear, thank you!

If z is truly Poisson, you can’t fit the model exactly in Stan and need to use something like the approximation you used (put a finite upper bound on z so that you can marginalize).

If you can write the density out, we can help you marginalize. So far, I still don’t understand what you’ve written down as a density at the front of your original post. There’s a Y that’s not used and I’m thinking maybe lambda(z) should be lambda[z]. What I’m looking for is:

p(Y, z, alpha, theta, lambda, ...) = ...

where you indicate the types of the variables and whether they’re data or parameters.

Thank you Bob!

Data description:

  • Y is the observed data (chromosome counts, so usually it is 2)
  • z (~Poisson) is the parameter, indicating the stats of that sample. if z = 0 (and alpha =0), then Y = 2 (Y = 2^0). But z can also be 1,2,or 3. If it is 3, it means there are a lot of duplicate of the chromosome in that sample. Its parameter is theta.
  • alpha (~Poisson) is the parameter, which indicate the amount of chromosome lost. Usually it is zero. Sometime you lose one chromosome, but it could be more than that.
  • lambda is the parameter for alpha, and since sample in each stages has their common lambda (lambda_1, lambda_2, lambda_3, lambda_4 for z = 0,1,2,3)

Further question

If z is truly Poisson, you can’t fit the model exactly in Stan and need to use something like the approximation

I do not understand why?? Is that because it only allows discrete? Do you suggest approximating with continuous valuables??

If you can write the density out, we can help you marginalize.

Thank you! I did not figure out the solution for the density of Y yet… In theory, is all density marginalizable??

And yes, that should be lambda[z+1]. (corrected the original post). Thank you for pointing out !

I just meant write the distribution down with the latent parameters—that’s why I included z.

Yes, in theory, any density is marginalizable as

p(a) = SUM_b p(a, b)

or in the continuous case as

p(a) = INTEGRAL p(a, b) d.b

They’re just not all marginalizable in practice. When the sum is infinite, the loop will never terminate.

Thank you again Bob! So in this case, do you think it is not marginalizable?
Also I would appreciate if anyone has any suggestion on alternative approach to specify the model ( I am quite new in bayesian modeling and will be grateful for any comments.) Thank you!

The usual thing to use a continuous z[n]. For instance, the beta-binomial distribution uses a continuous latent parameter with a gamma distribution and then plugs that into the Poisson.