Discrete probability distribution - continuous data?

Hi,

I have derived a discrete probability distribution for the variable z=\{0, 1, ... 2S-1, 2S\} and I have data which is an estimate of y_i=\frac{z_i}{2S}. However, due to the noise introduced when the data is gathered, y_i is a continuous variable between 0 and 1. I am having trouble formulating this model correctly in Stan, my first thought was to take the logit transform of y_i and then add normal noise to logit(\frac{k}{S}), however I think this fails when z=0,2S.

The probability distribution I have derived is:

p(Z=z | \mu , \gamma, \lambda, S) = \sum_{x=0}^{z} \binom{S}{x} \binom{S}{z-x} \frac{ \prod_{i=1}^{x}[(i-1)\lambda + (S-1)\mu] \prod_{j=1}^{S-x} [(j-1)\lambda + (S-1)\gamma] \prod_{k=1}^{z-x} [(k-1)\lambda + (S-1)\mu] \prod_{l=1}^{S-(z-x)}[(l-1)\lambda + (S-1)\mu]}{\left( \prod_{m=1}^{S}[(m-1)\lambda + (S-1)(\mu + \gamma)] \right)^2}

I would like to estimate the parameters \mu , \gamma, \lambda and S.

My Stan code:

functions{

    vector methylation_log_prob(real lambda, real mu, real gamma, int S){
        vector[S+1] lprob_x;
        vector[2*S+1] lprob_z;

        for (x in 0:S){
            lprob_x[x+1] = lchoose(S, x);

            for (j in 1:S){
                if (j <= x){
                    lprob_x[x+1] += log((j-1)*lambda + (S-1)*mu);
                }
                if (j <= S-x){
                    lprob_x[x+1] += log((j-1)*lambda + (S-1)*gamma);
                }
                lprob_x[x+1] += -log((j-1)*lambda + (S-1)*(mu+gamma));
            }
        }

        for (z in 0:2*S){
            if (z <= S){
                vector[z+1] lterm; 
                for (x in 0:z){
                    lterm[x+1] = lprob_x[x+1] + lprob_x[z-x+1];
                }
                lprob_z[z+1] = log_sum_exp(lterm);
            }
            else{
                vector[2*S-(z-1)] lterm;
                for (x in z-S:S){
                    lterm[S-x+1] = lprob_x[x+1] + lprob_x[z-x+1];
                }
                lprob_z[z+1] = log_sum_exp(lterm);
            }
        }

        return lprob_z;
    }

    real Mvalue_lpdf(vector M, real lambda, real mu, real gamma, real sigma, int S){
        vector[2*S+1] lprob_z;
        vector[num_elements(M)] LL;
        vector[2*S+1] LL_z;
        real frac;
        real min_M;
        real max_M;
        real logit_frac;

        min_M = min(M);
        max_M = max(M);

        lprob_z = methylation_log_prob(lambda, mu, gamma, S);
        for (i in 1:num_elements(M)){
            for (z in 0:2*S){

                frac = z;
                frac = frac / (2*S);
                logit_frac = logit(frac);

                LL_z[z+1] = normal_lpdf(M[i]| logit_frac, sigma) + lprob_z[z+1];
            }
            LL[i] = log_sum_exp(LL_z);
        } 

        return sum(LL);
    }

}

data {
    int<lower=0> N ;                 // Number of Sites 
    vector<lower=0,upper=1>[N] y ;     // Fraction methylated
    int<lower=2> S;         // Stem Cell Number
}

transformed data{
    vector[N] M;
    M = logit(y);
}

parameters {
    real<lower=0> lambda;            // Replacement rate
    real<lower=0> mu;                // Methylation rate
    real<lower=0> gamma;             // Demethylation rate      
    real<lower=0> sigma;    
}


model {
    lambda ~ normal(0, 5);            // Prior
    mu ~ normal(0, 5);                // Prior
    gamma ~ normal(0, 5);              // Prior
    sigma ~ normal(0, 5);              // Prior

    M ~ Mvalue(lambda, mu, gamma, sigma, S);
}

When I attempt to run this model, I get the error message -
Rejecting initial value:
Error evaluating the log probability at the initial value.
Exception: Exception: normal_lpdf: Location parameter is -inf, but must be finite! (in ‘unknown file name’ at line 46)

For reference, here is a histogram of my y_i values.
meth_hist

Rather than try to take the logit of 0, I take min(logit(M_i)) for the value of the mean of the normally distributed error, as in the following Stan code.

    real Mvalue_lpdf(vector M, real lambda, real mu, real gamma, real sigma, int S){
        vector[2*S+1] lprob_z;
        vector[num_elements(M)] LL;
        vector[2*S+1] LL_z;
        real frac;
        real min_M;
        real max_M;
        real logit_frac;

        min_M = min(M);
        max_M = max(M);

        lprob_z = methylation_log_prob(lambda, mu, gamma, S);
        for (i in 1:num_elements(M)){
            for (z in 0:2*S){

                if (z==0){
                    logit_frac = min_M;
                }
                else if (z==2*S){
                    logit_frac = max_M;
                }
                else{
                    frac = z;
                    frac = frac / (2*S);
                    logit_frac = logit(frac);
                }

                LL_z[z+1] = normal_lpdf(M[i]| logit_frac, sigma) + lprob_z[z+1];
            }
            LL[i] = log_sum_exp(LL_z);
        } 

        return sum(LL);
    }

The model now runs, however a single gradient evaluation takes ~30s and my computer dies before completing a single iteration. How can I speed up my Stan code so that it completes in a reasonable time?

Could you say more about what \mu, \gamma, \lambda, and S are and where they come from? Also, since S is discrete, it might be tricky to get that.

Thanks for your reply. Here is a summary of the model setup:

Consider a population of S switches which can either be ‘on’ or ‘off’. Each switch has the possibility of replacing one of the other S-1 switches at a rate \lambda. Further, each switch can flip from ‘off’ to ‘on’ with a rate \mu, or from ‘on’ to ‘off’ at a rate \gamma. Given that all the switches are initially ‘off’, what is the probability that X switches are ‘on’ at time t? We can assume that t>>\lambda^{-1}, \mu^{-1}, \gamma^{-1} .

By considering this as a random walk for k = {0, 1, ..., S-1, S}, we can derive the steady state probability distribution:

p(X=x|\lambda, \mu, \gamma, S)=\frac{\binom{S}{k} \prod_{i=1}^k [(i-1)\lambda + (S-1)\mu] \prod_{j=1}^{S-k}[(j-1)\lambda + (S-1)\gamma]}{\prod_{m=1}^{S}[(m-1)\lambda+(S-1)(\mu+\gamma)]}

However, for our data each container actually contains 2 switches at each site, and we measure the sum of these 2 switches. If we consider Z=X+Y, and we can assume independence of these two switches, then:

p(Z=z|\lambda, \mu, \gamma, S) = \sum_{x=0}^{S}\sum_{y=0}^S \delta_{x+y, z}p(X=x|\lambda, \mu, \gamma, S)p(Y=y|\lambda, \mu, \gamma, S)

Which yields the expression given in my initial question. It is likely that this expression is not quite correct, as the replacement process creates a correlation between X and Y, however the maths is significantly more difficult when considering these correlations.

I hope this explanation has made my problem slightly clearer. I know that Stan struggle with discrete parameters, however I was hoping that I could marginalise out S, as the range of possible S is quite small according to the literature. Following the first example in the manual

transformed data{
    vector[N] M;
    real log_unif;

    M = logit(y);
    log_unif = -log(n);
}


transformed parameters{
    vector[n] lp;

    lp = rep_vector(log_unif, n);
    for (i in 1:n){
        lp[i] = lp[i] + Mvalue_lpdf(M| lambda, mu, gamma, sigma, S[i]);
    }
}


model {
    lambda ~ normal(0, 5);            // Prior
    mu ~ normal(0, 5);                // Prior
    gamma ~ normal(0, 5);              // Prior
    sigma ~ normal(0, 5);              // Prior

    target += log_sum_exp(lp);
}

Does this seem like the correct approach to marginalisation, and is there a faster way to perform the gradient evaluation?

Just to not leave you hanging, it’ll take me a bit to get through this.

If S is small (like, we talking 5?) then it seems like it wouldn’t be prohibitive to marginalize it out like done in the change point example (https://mc-stan.org/docs/2_19/stan-users-guide/change-point-section.html). So I agree there.

I’m scared that there’s still a latent discrete parameter we’d want to estimate here. Not sure. I’d have to think more.

Have you tried generating data from this model without the continuous noise on top and fitting that? That might be the easier way to verify all the pmfs you’ve typed up so far.

Talked to a buddy about this. I remain confused.

The KS22 plot is showing the noise, right? So the noiseless data would be a pmf?

As far as the noisy version of things, I don’t know how you’d go from the measured continuous distributions to the underlying discrete distribution without sampling discrete parameters.

Thresholding manually seems like the easy thing to try, maybe this? Given S, I guess you’d know your bin edges? The downside would be that your bins would all be smeared together a bit. That’d be like taking the MLE estimator for Z, and then using that in your model.