Gaussian mixture model with interval censored data -- where do the divergences come from?


#1

I am fitting a simple mixture model to interval censored data. To do so I created a user-defined pmf based on a normal distribution which transforms the normal distributions to the discrete scale of my data.

functions {
  // Define log probability density function for an interval censored normal
  real intervalCensoredNormal_lpmf(int y, real mu, real sigma, vector breaks) {
    // This is a discrete log probability mass function defined for integers between 0 and rows(breaks).
    // Assigns the integral of the normal probability between breaks y and y+1 to y.
    // If y is 0 assigns integral of normal probability between -inf and break 1.
    // If y is rows(breaks) assigns integral of normal probability between the last break and +inf.

    real p;

    if (y == 0) { // from -inf to the first break
      p = normal_lcdf(breaks[1] | mu, sigma);
    } else if (y == rows(breaks)) { // from the last break to +inf
      p = normal_lccdf(breaks[rows(breaks)] | mu, sigma);
    } else { // between two breaks
      p = log_diff_exp(normal_lcdf(breaks[y+1] | mu, sigma), normal_lcdf(breaks[y] | mu, sigma));
    }
    
    return p;
  }
}

data {
 int<lower = 0> N;
 int y[N]; // an array with N elements
 int<lower = 0> K;
 vector[K] intervalBreaks; // a vector with K elements
}

parameters {
  ordered[2] mu; // this must be ordered (an ordered vector) so the chains do not constantly switch between the two distributions
  real<lower=0> sigma[2];
  real<lower=0, upper=1> theta;
}

model {
 mu[1] ~ normal(4, 4);
 mu[2] ~ normal(8, 4);
 sigma ~ normal(0, 5);
 theta ~ beta(1.5, 1.5);
 for (n in 1:N)
   target += log_mix(theta,
                     intervalCensoredNormal_lpmf(y[n] | mu[1], sigma[1], intervalBreaks),
                     intervalCensoredNormal_lpmf(y[n] | mu[2], sigma[2], intervalBreaks));
}

A figure of my data is attached. Note that the first and the last bar in the histogram actually comprise values that can be between (negative) infinity and the lower (upper) break of the bar, which I accounted for in the pmf. That’s why the first bar is so high.

Now the problem:

Everything runs fine and smoothly, but I do get over 100 divergences even with adapt_delta set to 0.9 or 0.99. Initially I wasn’t worried, since the divergences are quite well distributed in space when showing them with pairs from rstan (figure attached), so I didn’t expect them to introduce major bias. However, when looking at the same figure in shinystan (attached) all divergences are around the smaller mode, and I am afraid that this mode remains under-represented because of the divergences.

  1. Any ideas why those divergences arise and if I can get rid of them?
  2. If not, should I be worried that the fit is biased, based on the divergences shown in shinystan?

Thanks for your advice.

density.pdf (38.2 KB)
pairs.pdf (156.1 KB)
shinstan-bivariate(1).pdf (257.5 KB)


Using rstan::optimizing to obtain MLE does not seem to run with large number of parameters
#2

Since you only have 5 parameters, could you just post the pairplots with all five of them in there?


#3

Sure, here it is.

Again, when I plot the same thing in shinystan the divergences are not shown at the same place, but all concentrated in one of the modes (where mu[1] is around 3).

Thanks.


#4

Cool beans. Unfortunately it doesn’t seem to be saying much :P. None of the parameters are wildly big or small.

But looking back at your data, have you considered doing some sort of zero-inflated model looking thing? Check page 197, 13.7. Zero-Inflated and Hurdle Models in the 2.17.0 manual. It looks more like the situation is that than a mixture of two normals. Maybe that’d work better.

Back to “I wasn’t worried, since the divergences are quite well distributed in space” – it’s kinda hard to get much out of where the divergences are when they’re all spread around unfortunately. Any divergences are still bad news bears. @martinmodrak has a neat example here Getting the location + gradients of divergences (not of iteration starting points), but that’s down the line.


#5

That’s weird and never happened to me - please double check that you are using the same fit for ShinyStan and pairs. Otherwise it might make sense to file a bug report.

Generally divergences are a worry, IMHO the only way to make sure they are OK to ignore is to run Simulation based calibration for the model which is a lot of work and time - you might not want to go that way.

I second @bbbales2 that discretized normal seems to be a weird choice for the data, the data look very much like zero-inflated negative binomial.


#6

Thanks, I will have a look at other possible models

Just to reference, it seems to be known that pairs and shinystan give different information about the location of divergences:
https://groups.google.com/forum/#!msg/stan-users/nFx6k2KhLSA/KX6z4qxnAQAJ


#7

Just another stupid question: if I would implement the same model in JAGS, would it be likely to suffer from the same problems? Trying to identify if this is a Stan-specific problem or a problem related to my model and data.


#8

The way to check this is simulate and fit data from your actual model and see if the results are sane.

The only indicator you had that something bad was happening were the divergences which come from HMC sampler (Stan uses a variant of an HMC sampler internally). JAGs doesn’t use HMC and so wouldn’t have those diagnostics, so it might look like it’s working or working better when actually it’s not.


#9

I can reproduce a very similar behaviour with simulated data. So it has likely something to do with the interval censoring I guess.


#10

I reformulated the model with a categorical distribution, which is much nicer and simpler. I also replaced normal_lcdf and normal_lccdf with Phi because they return -inf and 0 for values of (y-mu)/sigma outside (-32.75,8.25), which probably caused some divergences (Stan Manual p523 (version 2.17.0-1)). This got rid of all the divergences.

data {
  int<lower = 0> N;
  int y[N]; // an array with N elements
  int<lower = 0> K;
  vector[K] intervalBreaks; // a vector with K elements
}

parameters {
  ordered[2] mu; // this must be ordered (an ordered vector) so the chains do not constantly switch between the two distributions
  real<lower=0> sigma[2];
  real<lower=0, upper=1> theta;
}

transformed parameters {

  simplex[K+1] tau; //the probabilities associated to the categories 1:K+1 with breaks intervalBreaks associated to them

//using Phi (with scaled and translated x) here instead of normal_cdf because normal_cdf returns 1 for (y-mu)/sigma > 8.25 (which is a very low limit). Also has some advantages to use the actual probabilities instead of logs here (because log_diff_exponential returns nan if a==b). See Stan Manual p523 (version 2.17.0-1).
//if this does not work I could also use Phi_approx, which has even a larger support.


  for (k in 1:K+1){
    if (k == 1)
          tau[k] =      theta * Phi((intervalBreaks[1]-mu[1])/sigma[1])
                      + (1-theta) * Phi((intervalBreaks[1]-mu[2])/sigma[2]);
    else if (k > 1 && k <= K)
          tau[k] =      theta * Phi((intervalBreaks[k]-mu[1])/sigma[1])
                      + (1-theta) * Phi((intervalBreaks[k]-mu[2])/sigma[2])
                      - theta * Phi((intervalBreaks[k-1]-mu[1])/sigma[1])
                      - (1-theta) * Phi((intervalBreaks[k-1]-mu[2])/sigma[2]);
    else if (k == (K+1))
          tau[k] =      1
                      - theta * Phi((intervalBreaks[K]-mu[1])/sigma[1])
                      - (1-theta) * Phi((intervalBreaks[K]-mu[2])/sigma[2]);
  }

  //print(tau)

}

model {
  mu[1] ~ normal(4, 3);
  mu[2] ~ normal(8, 3);
  sigma ~ normal(0, 5);
  theta ~ beta(1.5, 1.5);

  y ~ categorical(tau);
  
}

The posterior does still have two modes (identifiability problem) which is possibly due to problems with model choice highlighted by you guys.