Problems with L-BFGS in a simple Gaussian example

Hi Stan devs,

I am working on a spherical Gaussian mixture model where each cluster has observations from a reference set and a query set. Reference and query observations have similar but different means, so I represent the query means as reference + distortion, and I use strongly informative priors to shrink the distortion parameters. Unfortunately, the posterior estimates often exhibit literally no shrinkage, even when I use priors that are extremely concentrated. I don’t understand what’s going on at all, but I’ve produced a similar issue in the attached example. I’m hoping if I can understand this example, it will eventually help me solve my actual problem.

In the example, I presume all cluster assignments are known, so at this point I am just fitting some spherical Gaussians with Gaussian priors on the means; there’s no mixture model complications to worry about. Because of the known cluster assignments, I was able to pre-compute sufficient statistics for each Gaussian; this greatly improves the scaling because my loops run over the clusters instead of the individual observations. Past tests have confirmed that I get the same thing as slower code that goes over the individual observations, but let me know if you think this is causing issues. I need to fit this model many times over, so I have chosen to get MAP estimates via L-BFGS instead of MCMC.

I offer two stan models that are synonymous but differently parameterized. During fitting of either one, I get some messages that say “Error evaluating model log probability: Non-finite gradient”, but unless the prior is extremely concentrated, the optimizer continues and eventually issues a return code of 0. In the final fits, both seem to attain the same log posterior value (using different constants though). Only one pays attention to the priors; the other has little shrinkage.

Why do these two models give such different estimates? The log posterior is just a quadratic (with respect to every parameter except the variance), so it should have no local maxima to get stuck in. A back-of-the-envelope calculation shows that the posterior variance shouldn’t be very big either.

Differences in results:

different_map_estimates.pdf (316.1 KB)


run.R (2.9 KB) labels_known_reparam.stan (2.7 KB) labels_known.stan (2.4 KB)

Data are here:

The code here:

for(k in 1:K){
  gene_means[,K]       ~ normal(mean_prior_mean, mean_prior_sd);
  gene_distortions[,K] ~ normal(distortion_prior_mean, distortion_prior_sd);
for(k in 1:K) {
  gene_means_raw[,K]       ~ normal(0, 1);
  gene_distortions_raw[,K] ~ normal(0, 1);

Looks wrong in both files (I assume the index should be little k?)

1 Like

That certainly explains why the prior was being ignored! I wish I had Known that two weeKs ago! ThanKs!

1 Like