Warmup is settling in incorrect neighborhood when fitting complex model from neutrino physics

Sorry this is such a long post, but it was hard to give enough context to understand my problem and still be brief :-\

[first, context on my model]
I’ve been working for a while on fitting a model from neutrino physics to simulated data for a specific experiment. Fundamentally, the model has 3 parameters of interest, called \Delta m^2, \theta, and \delta; the latter two are angles and are therefore cyclic. The first one has no physical boundaries and all values are allowed. There are also numerous nuisance parameters that arise from the various systematic uncertainties in the simulation of the experiment, which are treated as Gaussian in this model.

The experiment has two detectors: one that is sensitive to both the parameters of interest and the nuisance parameters (let’s call it the “Physics Detector” for the purpose of this discussion); and a second detector which is sensitive only to a subset of the nuisance parameters, which is intended to constrain them (the “Systematics Detector”).

Because the simulation is run by an external library, I can’t use the Stan modeling language to fit it. But by reverse-engineering the code that comes out of CmdStan I’ve managed to figure out how to hook Stan up to the C++ interface of the code that drives the simulation. (I can give more details if needed.)

The likelihood given to Stan is computed by comparing histograms of measurements using a Poisson-inspired binned likelihood function; the simulation produces alternate histograms depending on the values of the physics & nuisance parameters.

[Ok, now my issue]
I’ve been exercising the model by fitting sets of simulated data using NUTS with a diagonal metric. Stan fits this model beautifully, and always converges to the correct answers, when I only fit the predictions for the Physics Detector, regardless of how many of the parameters of interest or nuisance parameters are included. It also fits the nuisance parameters successfully if I only fit the Systematics Detector alone. But when I attempt to fit the two together, Stan is in general unable to find the correct typical set for the parameters of interest; however, the nuisance parameters are correctly inferred. The values inferred for \Delta m^2 in particular are often orders of magnitude away from the correct answer.

I think this is because the curvature of the likelihood function in the nuisance directions is very different from that in the directions of the physics parameters. (That is, because the Systematics Detector constrains the nuisance parameters powerfully, the effect on the likelihood of changing the nuisance parameters is much stronger than changing the physics parameters.) I guess the ‘phase change’ in LL when the typical set around the nuisance parameters is found is somehow confusing the adaptation from continuing to converge onto the physics parameters?

By looking at the values chosen as a function of sample number during warmup, I see things that suggest this. (For the plots of parameters, correct values, where they are within the axis ranges, are given by the dotted lines.)

First, the (negative of) the log of the log likelihood (because so many orders of magnitude are being crossed; plotting LL makes it impossible to see any features):

Then the parameters of interest:

Lastly, the 5 nuisance parameters that were sampled in this particular run (I’ve done many others with anything from 1 to the whole suite of about 75, with similar results):

You can see Stan converging onto the correct values of the nuisance parameters in stages by sample 100 or so. Interestingly, two of the physics parameters hardly move at all until this, while \Delta m^2 moves at the beginning, but in the wrong direction (the correct value is +2.5). (I ran several hundred chains with the same settings and this behavior is fairly typical. Only about 15% of the chains even came close to the right neighborhood.)

I think it’s likely that the complexity of the model is part of the problem. E.g., if I brute-force scan across \Delta m^2 (with the other parameters held fixed at their correct values), the LL has a lot of features (note right plot is a zoom of the left with the correct value indicated by the pink arrow, the only point that’s actually at LL = 0):

It’s essentially a monotonic function on either side of the maximum (increasing below it, decreasing above it) convolved with two different periods of wiggles. It seems that Stan is usually getting stuck in one of the larger period wiggles. What baffles me, though is that (as noted above) Stan has no trouble fitting this model and finding the maximum when I don’t include the second detector constraint (even if the same nuisance degrees of freedom are incorporated).

I’ve tried numerous iterations on this to see if I can get the exploration of the physics parameter space to improve (with several hundred chains in each permutation to make sure it wasn’t a function of the initialization point):

  • Increasing the number of warmup samples (though from these metrics you can see it’s probably fully warmed up by around sample 100, and that’s also typical)
  • Increasing the number of samples used in the initial “fast” adaptation period (init_buffer = 125 in CmdStan parlance)
  • Varying adapt_delta: tried 0.85, 0.9, 0.95 with only marginal improvements

I don’t quite understand the adaptation process well enough to be able to guess whether I should keep trying more values for these hyperparameters (adapt_delta=0.99?) or whether there are other things I should try instead (use a dense mass matrix?).

Does anyone have advice on what I can try to help Stan explore the model space better?

I’m more than happy to supply any other plots or further info if that would be helpful to try to figure out what’s going on. (There’s also no reason I couldn’t share the code, though it’s bolted onto a much larger framework that is unlikely to give a non-expert much insight without weeks of study, so I’m not inclined to go that route unless somebody wants to work on this longer-term with me.)

I’d love any insight anyone can offer!

1 Like

Do you have any prior information on the magnitude of \Delta m^2? If you have, then incorporate that information as informative prior and initialize the parameter in the typical set of that prior.

Thanks for responding so fast! I forgot to mention it, but I did indeed try with a prior that is perhaps a bit looser than what I would use in a final fit but should have been a pretty strong constraint — Gaussian with \mu = 2.5 and \sigma=0.5 (\times 10^{-3} eV^{2}). This didn’t meaningfully change the behavior—for instance in one run I examined, the samples were centered around \Delta m^2 \sim 22.2 with an RMS of \sim0.02 — despite the fact that the prior yields a LL of -19500 at that value, which seems like it should be more than enough to drive it back towards the center.

Did you also initialize with values compatible with that prior information? If the posterior is multimodal, chains can still get stuck.

Yep! I’ve been initializing with \Delta m^{2} = +2.525, which is quite close to the correct value and well within the prior range.

Presumably you are using realistic priors for the phase factor and the 23 mixing angle as well?

Those are a bit harder.

  • For \delta_{CP} a “realistic” prior is essentially uniform. (There are combinations with some other parameters of interest that are disfavored, but given the right combinations any value of \delta_{CP} is consistent with the available measurements.)
  • For \theta_{23} a similar situation is true—the “realistic” values of \theta_{23}, per the existing measurements, depend a lot on the value of \Delta m^{2}. So I left it as uniform as well.

Because both are angular variables, I make them bounded parameters in [0, 2\pi). This at least prevents the huge jumps in step size I saw with \Delta m^2, though it’s not clear to me if there are any other side effects.

(Because there are several variables and the likelihood as a function of them is quite complicated (and it’s calculated quasi-numerically in a big library), it doesn’t seem to me like there’s a straightforward way to reparameterize to make joint priors over them possible.)

I also want to reiterate that when I only fit the “physics detector” data in the scenario above, everything works just fine, with the priors and large numbers of different seed points I’ve tried within the general range. I have the impression that there’s something qualitatively changing when adding strong constraints to the nuisance parameters with the “systematics detector” and I can’t quite figure out what.