Hi Stan devs,
I am working on a spherical Gaussian mixture model where each cluster has observations from a reference set and a query set. Reference and query observations have similar but different means, so I represent the query means as reference + distortion, and I use strongly informative priors to shrink the distortion parameters. Unfortunately, the posterior estimates often exhibit literally no shrinkage, even when I use priors that are extremely concentrated. I don’t understand what’s going on at all, but I’ve produced a similar issue in the attached example. I’m hoping if I can understand this example, it will eventually help me solve my actual problem.
In the example, I presume all cluster assignments are known, so at this point I am just fitting some spherical Gaussians with Gaussian priors on the means; there’s no mixture model complications to worry about. Because of the known cluster assignments, I was able to pre-compute sufficient statistics for each Gaussian; this greatly improves the scaling because my loops run over the clusters instead of the individual observations. Past tests have confirmed that I get the same thing as slower code that goes over the individual observations, but let me know if you think this is causing issues. I need to fit this model many times over, so I have chosen to get MAP estimates via L-BFGS instead of MCMC.
I offer two stan models that are synonymous but differently parameterized. During fitting of either one, I get some messages that say “Error evaluating model log probability: Non-finite gradient”, but unless the prior is extremely concentrated, the optimizer continues and eventually issues a return code of 0. In the final fits, both seem to attain the same log posterior value (using different constants though). Only one pays attention to the priors; the other has little shrinkage.
Why do these two models give such different estimates? The log posterior is just a quadratic (with respect to every parameter except the variance), so it should have no local maxima to get stuck in. A back-of-the-envelope calculation shows that the posterior variance shouldn’t be very big either.
Differences in results:
different_map_estimates.pdf (316.1 KB)
Data are here: