Almost all of these models wind up with long tails somewhere. Like this one, it’s looking at alcohol consumption, and it seems that on order of 99% of all people consume 6 or more alcoholic beverages in a day at rates approaching 0 days per year (there is a ginormous spike at basically 0+epsilon days/yr) and then 1% of people consume between 7 and 25 alcoholic beverages every other day (!!!)
The “misfit” I meant was that the estimate of total per capita consumption wasn’t matching what is coming out of the tax records for sales, so it wasn’t a “misfit” in the sense of the model, but a mis-fit between my estimation method and someone else’s based on different data.
The thing is, the short treedepth=9 exploration showed essentially identical distributions of beverage consumption as the treedepth=14 exploration. It gave me qualitatively identical graphs, and essentially the same global estimates of consumption, and the like, except it took 20 minutes instead of 20 hours. Now one thing is that in some sense almost all 8000 parameters are nuisance parameters (they describe individual behavior). The thing I really care about is a series of 7 kernel-density plots for the population distribution of these parameters.
If I get 7950 of them to be well mixed and 50 of them are kinda stuck… it doesn’t really change the population level estimates of interest, and these things have long tails, so it’s surprising if out of 8000 none of them got stuck…
A similar thing occurred with the last analysis, which was survey analysis of financial expenses for 2500 microdata regions. Again, a bunch of nuisance parameters that only matter because they lead to a population estimate within each region.
Typically in these problems what happens is my stepsize crashes down, the initial 10 or 20 iterations or so finally make it to the typical set after 160,000 gradient evals and an hour of computing, then I max out my 14 treedepth for every iteration after that until it hits my requested iterations and stops. the mass matrix NEVER winds up different from 1 on the diagonal, so adaptation isn’t even really occurring.
I find the whole thing confusing. But, especially for model-building… where VB is not good enough to test the model, limiting the treedepth may get me to the point where I can assess the quality of the model about an order of magnitude or two faster than long-treedepth exploration. For models I don’t necessarily believe yet, it makes no sense to me to do a whole day of computing, only to find out that I should have realized I needed to add a parameter describing effect foo… or collapse individual level parameters down to populational average ones, and run the whole thing again.
Once i get a model I believe, I can certainly see that re-running with very long treedepth might make sense to avoid anything screwy occurring. But I still don’t understand why so often my treedepth blows out and my step-size crashes… maybe I just like to fit wacky-hard models (actually, yes I know this is true)