Convergence in a very shallow minimum



I have a model which turns out to have a very shallow minimum. The model aims to resolve two parameters, the first does converge to the proper value of -7.269, while the second one is supposed to converge to -0.172, but instead descends to -infiinity. I printed out the total log likelihood value for all the nearby parameter 2 values.

alphaArray = -7.269,1.5
totalLL = -51201.520345485376

alphaArray = -7.269,1.0
totalLL = -44408.776836654695

alphaArray = -7.269,0.0
totalLL = -43401.90530070493

alphaArray = -7.269,-0.03
totalLL = -43398.43894233183

alphaArray = -7.269,-0.172
totalLL = -43385.916293968396

alphaArray = -7.269,-0.3
totalLL = -43378.69248797088

alphaArray = -7.269,-1.0
totalLL = -43367.911351712995

alphaArray = -7.269,-2.0
totalLL = -43370.54172133411

alphaArray = -7.269,-10.0
totalLL = -43374.37777766442

alphaArray = -7.269,-1.0E11
totalLL = -43374.3792981491

As you can see, there is barely any difference to work with once the second parameter gets past 1.0 in the negative direction. Does HMC have any ability to resolve such a shallow minimum?


Does HMC have any ability to resolve such a shallow minimum?

It’d just diagnose the problem in a different way, probly.

If everything between -10 and -1e11 is a reasonable value for the second parameter, then HMC would just spend all its time trying to explore that really big posterior, and it probably wouldn’t do it efficiently.

You probably wanna regularize this parameter somehow, for the optimizer, and for sampling.


Is it a matter of reducing the step size or restricting search path? I know hmc is very efficient in jumping large distances in each draw, I am wondering if more traditional MC MC methods is more capable of zeroing in on such a shallow minimum?


With these things it’s the relative scales of the parameters. If your alphaArray[0] value is about 7, ± 1, and then the other value is 0.5e11, ± 0.5e11, it’s gonna be hard for any numerical method to make sense of that. You could rescale your parameter space to make the scales more uniform.

HMC would do as good as anything (with or without the transform), but pretty much any method is gonna perform way better if it’s possible to put some regularization on that parameter.


I know about L1, L2 regularization, which applies a simple penalty for large coefficient values. Is that what you are suggesting here as well? or some other regularization scheme which I may not be aware of.


Yeah, something like that, or a prior on the second parameter, or whatever you wanna call it. Whatever terminology makes more sense for what you’re doing (if it’s just fixing numerics, I’d call it regularization, or if it’s interpretable as a prior on some parameter, you can go that way).

As long as you don’t really need to explore that whole ridge, get rid of it!


Still at this. I am able to use prior to push the second parameter towards the right answer, but in a real world situation, I really have no way of determining that prior. Also from experimenting with different prior distributions, it appears the prior is the sole determinant of the outcome, as the model itself seems to have no influence at all.

I should point out that I am NOT solving for both of the parameters. In this test, actually, I fixed the first parameter and only look for the second.

When I check similar outputs for a parameter which the model is able to resolve, I see that when the difference in Log likelihood is in the range of the 3rd decimal place (~1%), it works very well. In this example, as you see in the original post, the difference in the 4th decimal place. However, the minimum clearly does exist and is hovering around the correct range of value. (-43367.911351712995 vs. -43378.69248797088). Back to my original way of thinking, is it possible to get stan to resolve at that level of sensitivity by adjust some configuration setting?


If you can’t justify a prior, and the model doesn’t identify the parameter, not much you can do aside from try to figure out what data you could collect would identify your parameter.

If this is just a two parameter model, then you’ve got a lot more freedom to brute force things I suppose.

Nah, what the sampler goes for is the posterior. There isn’t really any fiddling beyond getting the model/parameterization in the way you like it. If you’re using the optimizer there would be some convergence criteria, if that’s what you want.


You brought up a good point. I found sample code for rstan, pystan and matlabstan, but nothing for cmd stan. would you mind sending a reference to that?


Cmdstan is here:

edit: The model should be the same, though the data is formatted a bit differently. It’s documented in the cmdstan manual though, and there’s a way to export from R that is alright.


didn’t realize using the optimizer is so easy. Nevertheless, the results isn’t any better, at least based on default settings.


I don’t think so. What you want to do is rescale parameters if you can so the priors are all roughly unit scale (ideally standard normal, as well).

We usually resort to hierarchical models for this.


HI Bob. I don’t get the second point. I know about hierarchical models, if you mean things like consumer groups. There are plenty of examples of this in EP and else where. How does this address the unknown prior? that’s not at all obvious to me.


It doesn’t address an unknown prior family, but it addresses paramters within a family by estimating them along with the rest of the parameters.