I’m fitting a mixture model in Stan to strictly positive data. Each component distribution has a mean that can be expressed by some summation of a and b. (a and b are two “factor” parameters shared across groups.) The mixing proportions are denoted as weights, w.
I have included the density plots for a, b, and w. They seem oddly shaped to me, and I was wondering if I should be worried.
My concerns: The density plot for b exhibits several bumps. Intuitively, as it is a factor of the mean of a distribution, I would have expected it to follow a relatively bell‐shaped curve, maybe with a right skew due to the restriction to positive values. The weights w2 and w3 have wide intervals. Is this an indicator that the parameter is not identifiable or weakly identifiable from the data?
What have I done so far?
I enforced an ordering constraint on the component means to prevent label‐switching.
I simulated data sets from the prior, and the parameters are recoverable by the model. (The 95% HPDI and CIs both contain the simulated parameter values). This leads me to believe structural non-identifiability is not the problem.
Convergence metrics (All R̂ ≈ 1.00) and trace plots indicate good mixing of chains. Auto-correlation plots don’t show any major concerns after thinning. Effective sample sizes seem sufficiently large.
Posterior predictive checks don’t show any cause for concern. The max, min, mean, median, mode, and sd of simulated datasets reasonably compare to the observed data. This leads me to believe that the model reasonably reflects the data generation process.
Nevertheless, I’m uneasy: are these jagged and wide density plots a red flag or a common occurrence in mixture models? Do I have reason to be concerned or am I searching for problems where there aren’t any? Any recommendations for diagnostics or best practices would be appreciated. Thank you!
Are you using positive constrained component distributions that are reasonable for the data?
How many draws did you have and what was the effective sample size? This just looks like what yo’d get from not miuch data—you’ll see this if you estimate densities from small amounts of data. Otherwise, bumps usually indicate a form of multimodality.
The posterior predictive check is really the gold standard for whether a model can fit the data. If those are fine and you’re measuring the right statistics of interest, then it should be fine.
Thank you so much for the response & apologies for all the questions included in my follow-up.
I am using negative binomials for the component distributions. (The data is discrete). I am using a truncated normal distribution as the prior for a+b. (We have prior information from previous studies on a+b, but not the individual values. I then used a beta distribution for the prior on the proportion of a+b accounted for by a).
I have 4 chains of 10000 iterations with thinning = 5. I used the default, so the first half of the samples are dedicated to warmup. This leaves me with 4000 posterior draws. Could that be too few draws? It seems like a reasonable number to me, but I am new to Stan so maybe I am overlooking something obvious concerning the number of draws.
The effective sample sizes are shown below:
w[1] 3880.632
w[2] 3758.774
w[3] 3693.810
w[4] 3904.761
w[5] 3850.515
w[6] 4173.410
w[7] 3705.880
w[8] 3956.897
w[9] 4016.341
w[10] 3834.332
w[11] 3975.094
b 3506.461
a 3950.396
To me, the effective sample sizes didn’t raise any red flags as they are all near the 4000 mark. Am I missing something here?
Yes, I am concerned about multimodality. It seems like multimodality would be a sign of label switching, but again, I do not think that is the problem here as I have enforced an order on the means of the component distributions. I did not think it was necessary to enforce order on both weights and means to prevent label switching. Structural non-identifiability or partial-identifiability due to the data set used would be the other culprits for multimodality, right?
I have included the pairs plot for a and b in case that is relevant.
Thanks for the reassurance regarding the posterior predictive check! All of my posterior predictive p-values sit in the 0.4-0.6 range, and histograms of predicted values display the same patterns as observed data.
So a and b are continuous values? if so, then this all seems reasonable assuming you want the prior to penalize large value with exponential tails and be consistent with zero. If you contrast a lognormal, for example, it approaches zero density as the value approaches zero, whereas half normal approaches a fixed positive value at zero. If the beta distribution is being estimated, a mean and total count parameterization usually works better and is easier to parameterize as a prior.
Negative binomials can be hard to fit, but if that’s not a problem, they should be better than a Poisson.
That ESS looks good—good enough you can probably get away with fewer iterations and no thinning. But if you want to do kernel density plotting, you may need more iterations to get everything to look super smooth. I’m attaching a plot generated using default kernel density estimation in R for a standard normal:
> x <- rnorm(4000)
> plot(density(x))
You can see that its default bandwidth leaves the plot quite jagged. If you make the bandwidth smaller (e.g., argument density(x, bw=0.1)) it gets even jaggier and if you make it wider, it gets smoother.