Here’s how the current model that is giving me fits works:
It’s estimating how people in the US actually consume alcohol. It uses a CDC / Census dataset sample of 2000 telephone interviews downsampled from a 500,000 or so interview sample. It models each person’s behavior as a dirichlet vector of 7 entries corresponding to frequency of: 0,1,2,3,4,5,6+ drinks per day, plus a quantity for each person describing the amount they drink on the 6+ days.
Then it has some global parameters that describe measurement errors / biases / etc so that you can predict how a person will answer the questions based on their underlying behavior vector.
Each person’s dirichlet vector gets a mixture prior over 7 basic modes of drinking behaviors “very light, light, medium, binge, heavy, very heavy, other”, the mixture weights are given their own dirichlet vector with a simple dirichlet prior.
I derive an average consumption per day for each person in the transformed parameters, and then I have a global sales per person per year estimate from a separate survey and the overall average across my sample of 2000 enters into a likelihood for observing this global sales number.
The end result is that I get a behavior estimate for each person in the sample. The fact is, I don’t really care about individual behaviors that much though, I care about the distribution of consumption across the 2000 people accounting for known measurement issues. A big problem with all of this is that people are known to do a very poor job of estimating their own alcohol consumption, and some people misunderstand the question in obvious ways, also some people report inhuman alcohol consumption (say 70 drinks a day).
Without a measurement error model you go very far wrong by just believing people’s responses, but with a measurement error model it’s still hard to explain the disparity between survey answers and national sales data.
The fact is, though, that inaccuracy in estimating a few of the behaviors, say 10 or even 50 of the people out of 2000 really doesn’t make much difference in the final results. Graphs all look the same qualitatively, and overall averages and percentile curves and etc all have similar shapes. But, of course with so many weird behaviors (1% of people really do drink basically every day, and an average of around 12 drinks a day! whereas huge numbers of people drink 0 drinks between 20 and 30 days a month. ) Inevitably some of these dimensions will have tricky to sample regions. In this model, the “pairs” plot would produce a grid of 14,000 x 14,000 = 10^8 th plots so it’s not particularly useful to attack things in that way.
What really matters here is the behavior of the measurement error model. Does it do a good job of imputing something within the range of rational for each person? For the most part, just taking the posterior mean for each person can be enough, it just needs to be one-value-per-person, but corrected for measurement issues.
Anyway, I think Stan does a FANTASTIC job with models that have 5 to 10 parameters, and in those cases, you absolutely need to have exact convergence to the posterior distribution over those 5 or 10 parameters. With micro-data inferences involving thousands of parameters on thousands of individual people, each one isn’t so important, and so in some sense point-wise convergence of the distribution is a lot to ask. I’d be happy with convergence to a posterior whose high probability region is the same typical set as the typical set of my actual posterior. Hence, for example convergence to the distribution that’s uniform over the set where my real lp is within epsilon of the mean lp. The relative weights within the typical set are way way less important, because the typical set is knife-edge-thin anyway and the inferences about individuals are not all that important so errors in those inferences don’t affect the inference at the higher-level more important quantities.
If I could get a large uniform-over-the-typical set sample in very little time, sub-sampling it proportionate to the density would give me a very very good sample. But VB doesn’t do that, and Stan tries in some sense too hard, requiring each and every energy-level exploration to solve Hamilton’s equations exactly to double-float precision, which is a lot more than is necessary to get that “uniform-over-the-typical-set” sample I think.
Not that Stan isn’t doing a great job of what it’s doing, just that maybe for certain classes of high dimensional models it’s trying a lot harder than necessary given concentration-of-measure. Adding in some “wiggle room” in the exploration, in a way that fuzzes out the sample towards “uniform-on-the-typical-set” would I think require a lot less computing, but I just don’t know.