Dirichlet_log: probabilities is not a valid simplex


#1

Hi,

I am trying to fit a 4-component mixture model where the probability of component membership of each sample (pi) is estimated in a separate stan fit. I am using dirichlet to introduce uncertainty but getting many of “Exception thrown at line 20: dirichlet_log: probabilities is not a valid simplex. sum(probabilities) = nan, but should be 1”. I am using the first component to capture samples whose n is zero. Where do you think I am getting nan? Any comment would be greatly appreciated.

data {
    int<lower=1> N;
    int<lower=0> n[N];
    vector<lower=0,upper=1>[4] pi[N];
}
parameters {
    real<lower=0,upper=1> a4;
    real<upper=max(n)> a5;
    real<lower=1> b5;
    real<upper=max(n)> a6;
    real<lower=1> b6;
    real<upper=max(n)> a7;
    real<lower=1> b7;
}
model {
    for (i in 1:N) {
        vector[4] lps;
        vector[4] tau;
        tau ~ dirichlet(pi[i]);
        lps[1] = tau[1] * exp(neg_binomial_2_lpmf(n[i]|a4,  1));
        lps[2] = tau[2] * exp(neg_binomial_2_lpmf(n[i]|a5, b5));
        lps[3] = tau[3] * exp(neg_binomial_2_lpmf(n[i]|a6, b6));
        lps[4] = tau[4] * exp(neg_binomial_2_lpmf(n[i]|a7, b7));
        target += log(sum(lps));
    }
}

#2

from the entire model block:

    for (i in 1:N) {
        vector[4] lps;
        vector[4] tau;
        tau ~ dirichlet(pi[i]);
        lps[1] = tau[1] * exp(neg_binomial_2_lpmf(n[i]|a4,  1));
        lps[2] = tau[2] * exp(neg_binomial_2_lpmf(n[i]|a5, b5));
        lps[3] = tau[3] * exp(neg_binomial_2_lpmf(n[i]|a6, b6));
        lps[4] = tau[4] * exp(neg_binomial_2_lpmf(n[i]|a7, b7));
        target += log(sum(lps));
    }

First of all, tau is undefined in this loop. You need something like

simplex[4] tau[N];

in the parameters block.

Once you fix that, you have to use numerically stable constructions like:

        lps[1] = log(tau[i,1]) + neg_binomial_2_lpmf(n[i]|a4,  1);
        lps[2] = log(tau[i,2]) + neg_binomial_2_lpmf(n[i]|a5, b5);
        lps[3] = log(tau[i,3]) + neg_binomial_2_lpmf(n[i]|a6, b6);
        lps[4] = log(tau[i,4]) + neg_binomial_2_lpmf(n[i]|a7, b7);
        target += log_sum_exp(lps);

#3

I appreciate your response. One quick related general question. Do I need to declare tau in the parameters block even if I am not trying to estimate it?


#4

Now I am getting “dirichlet_log: prior sample sizes[1] is 0, but must be > 0!” with the following fix.

data {
    int<lower=1> N;
    int<lower=0> n[N];
    vector<lower=0,upper=1>[4] pi[N];  // data shape should be (N, 4)
}
parameters {
    real<lower=0,upper=1> a4;
    real<upper=max(n)> a5;
    real<lower=1> b5;
    real<upper=max(n)> a6;
    real<lower=1> b6;
    real<upper=max(n)> a7;
    real<lower=1> b7;
    simplex[4] tau[N];
}
model {
    for (i in 1:N) {
        tau[i] ~ dirichlet(pi[i]);
    }
    for (i in 1:N) {
        vector[4] lps;        
        lps[1] = log(tau[i,1]) + neg_binomial_2_lpmf(n[i]|a4,  1);
        lps[2] = log(tau[i,2]) + neg_binomial_2_lpmf(n[i]|a5, b5);
        lps[3] = log(tau[i,3]) + neg_binomial_2_lpmf(n[i]|a6, b6);
        lps[4] = log(tau[i,4]) + neg_binomial_2_lpmf(n[i]|a7, b7);
        target += log_sum_exp(lps);
    }
}

#5

How many times? Underflowing tau to 0 is not as big a deal.


#6

Many many times. If the message implies underflowing, I think it is all right since my pi’s contains lots of zeros although they sum to one. The fitting did not finish with the following message at the end.

Initialization between (-2, 2) failed after 100 attempts. 
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.

I guess having zeros in dirichlet parameter is not good?


#7

Yes, the concentration parameters should be positive although they do not have to sum to 1.


#8

The fitting finishes successfully with the following change although it may have introduced some bias.

tau[i] ~ dirichlet(pi[i]+0.000001);

Thanks Ben!


#9

Adding small epsilons is rarely what you want to do. How were those pi set in the first place? You should just set them in the data rather than setting them zero then adding epsilon. You should also run sensitivity analyses to see how much they affect inference—these things can sometimes have a suprrisingly strong effect.

When you have Dirichlet parameters less than one, the prior is putting most of its mass on solutions with zeros (technically, very very small non-zero values that round to zero because 1e-350 is roughly the smallest number that can be represented).

Negative binomials are already hard to fit—I imagine making a mixture of them is going to be extra difficult unless the clusters are well separated. Michael Betancourt wrote a case study on mixtures (see web site >> users >> documentation >> case studies).

You can simplify this a bit:

vector[4] lps;        
lps[1] = log(tau[i,1]) + neg_binomial_2_lpmf(n[i]|a4,  1);
lps[2] = log(tau[i,2]) + neg_binomial_2_lpmf(n[i]|a5, b5);
lps[3] = log(tau[i,3]) + neg_binomial_2_lpmf(n[i]|a6, b6);
lps[4] = log(tau[i,4]) + neg_binomial_2_lpmf(n[i]|a7, b7);

to

vector[4] lps = log(tau[i]);        
lps[1] += neg_binomial_2_lpmf(n[i]|a4,  1);
lps[2] += neg_binomial_2_lpmf(n[i]|a5, b5);
lps[3] += neg_binomial_2_lpmf(n[i]|a6, b6);
lps[4] += neg_binomial_2_lpmf(n[i]|a7, b7);

I don’t understand the naming of the a and b. Why not follow the ordering and make those

vector<lower = 1> b[3];
vector<upper = max(n)> a[4];

You should also give the a and b priors—like othe precision-like parameters, they can tend to drift upwards because ever higher values correspond to ever lower variance.


#10

This is ultimately a pathology of the empirical Bayes strategy of plugging a point estimate from some previous fit into the priors. It’s just made worse by the fact that the point estimates are themselves lie on the corners of parameter space. Adding epsilon to all of the pi regularizes the point estimates and avoids the immediate pathology, but one should still be skeptical of the validity of the resulting fit.


#11

I am dealing with a set of binomial (n_i, p_i) samples where there are four different classes according to p_i’s. Let’s assume they cluster pretty well. After I estimate the expectation on 4-dim latent variable (pi), I wanted to get the shrinkage estimators for n_i’s based upon the probabilistic class membership of each sample. In order to implement this idea, I had to separate classification and shrinkage estimation. Do you think there is a better way than using dirichlet? Do you think there’s a way to combine the classification and the shrinkage in one stan fitting? I would really appreciate any comment.


#12

Not only is the answer “yes”, it’s the standard approach in Bayesian hierarchical modeling.

The Dirichlet is a very weak prior in that it only gives you a way to model location and a kind of precision, not anything like covariance. A multivariate logistic normal is more flexible. You can see that in the “Correlated topic model” section of the manual (in the latent Dirichlet allocation section of the “Clustering Models” chapter).


#13

@Bob_Carpenter Here’s my original problem. Because I am trying to “reuse” pi, I don’t feel this is a standard situation for stan. Any advice would be great.

data {
    int<lower=1> N;
    int<lower=0> n[N];
    int<lower=0> x[N];
}
parameters {
    simplex[4] pi;
    real<lower=1> a[4];
    real<lower=1> b[4];
    real<upper=max(n)> c[4];
    real<lower=1> d[4];
}
model {
    vector[4] log_pi = log(pi);
    for (i in 1:N) {
        vector[4] lps = log_pi;
        for (j in 1:4)
            lps[j] += beta_binomial_lpmf(x[i]|n[i], a[j], b[j]);
        target += log_sum_exp(lps);
    }

    //
    // Now I also want to fit n's to estimate c's and d's using the same **pi**,
    // but I have already used "target" to fit x above.
    //
    //    for (i in 1:N) {
    //        vector[4] lps2 = log_pi;
    //        for (j in 1:4)
    //            lps2[j] += neg_binomial_2_lpmf(n[i]|c[j], d[j]);
    //        target += log_sum_exp(lps2);
    //

}

#14

It’s not a problem to use target more than once. You would just be adding more components to the log likelihood. It sounds like you have two equations per observation which is fine. It can help to identify the pi’s but it can also lead to a more complex geometry. I would lookout for divergences.


#15

@stijn It takes a long time and did not converge. In fact, I don’t want the neg_binomial components to influence the estimation of pi and that is the cause of my headache. I guess it is also related to a more general question: What are my options when n is not fixed in my binomial mixture data?


#16

I am just trying to think generatively here. This may or may not be helpful.

Your data generation process is something like the following. You have 4 different groups. In each group n ~ neg_binomial(c_j, d_j) and x ~ beta_binomial(n, a_j, b_j) with a_j, b_j, d_j > 0 and c_j < max(n).

You can already use that information to simulate x’s and n’s outside of Stan. First simulate groups of 4 a_j, b_j, c_j, d_j, then simulate n, and next x. You should then check whether these simulations are reasonable for your data. If not, you can start thinking about what is wrong with how you have written the model so far. For instance, you probably need more informative priors on a, b, c, and d. You have a mixture model with discrete distribution. Those are red flags for identifiability problems.

That made me think that maybe you expect the 4 groups to have different a_j, b_j but roughly similar c_j and d_j. If that is the case you do not need the mixture model for the neg_binomial. You can just work with a singular c_0 and d_0. When the neg_binomial component does not help to identify the groups (my previous assumption) and you are not interested in the c’s and d’s, you may not even need to estimate the neg_binomial components. (I am probably wrong in my assumptions but I am trying to understand why you want the neg_binomial in there).

What do you mean by ‘not fixed’? n varies per observation or n has measurement error or n varies per group or something else?


#17

This is just how Bayesian inference works. BUGS includes a hack called “cut” that breaks Bayesian inference by preventing information flow between random variables. We intentionally avoided that in Stan.

This is the way to think about things!


#18

I was trying to test whether n’s are different between clusters by comparing their confidence intervals, along the same sampled pi. I still needed to estimate c’s and d’s for that purpose. But I also found I was not thinking generatively. I appreciate your help @stijn and @Bob_Carpenter.


#19

I think you can do that in the generated quantities block or outside of stan. You don’t need the c’s and d’s or the negative binomial in your model.

  1. Simulate group membership ~ categorial_rng(pi) for each observation in each draw
  2. Calculate the mean or median of n or 80% interval or whatever floats your boat for each simulated group
  3. For each simulation, check whether the mean of group 1 > group 2 (or any other comparison you want to make).
  4. Interpret the distribution of the variable you calculate in 3. E.g. if in 50% of the draws group 1 > group 2 and in 50% group 2 > group 1, the groups have roughly the same mean n.

#20

Stan’s generating posterior intervals rather than confidence intervals.

You can usually just do these categorical estimates in expectation directly rather than simulating. That is, you get the weight of pi in each iteration for the expectation calculation. Same expected result as simulating, but more efficient due to Rao-Blackwell.

You can’t do this easily for continuous examples because the simple sums turn into integrals—so for continuous cases the RNG approach will usually be the best way to code it.