Asymptotic computational complexity of HMC - for a simple GMM tutorial - is it - in the first approximation - O(P^3*N) ? where : P- number of parameters, N - number of samples

Dear fellow researchers,

I plan to try to use the code example in this 4-D GMM “tutorial” (that uses a toy dataset of 90 points)
to fit the same GMM to a similar toy-dataset of 10.000-100.000 samples.

Will the large sample size be a deal-breaker in doing that ?

I had a very quick look at https://arxiv.org/pdf/1701.02434.pdf to get an answer for this question, and to me it seems, after a very quick look, that the computational complexity scales linearly with the number of data-points - is that correct ? On the other hand it seems to scale cubicly with the number of parameters (due to the need to take the inverse of a matrix that has the size of the number of parameters).

Am I reading that paper correctly ? ( I am a bit new to Stan / MCMC for Bayesian inference - but have theoretical solid state / statistical physics background and I also did a lot of - large scale - computer simulations (MC and MD) for modelling amorphous semiconductors and crystal growth - so the topic - i.e. HMC sampling - looks extremely familiar - but I only spent 20 min on skimming through that paper, trying to answer the question what “equation” describes - the asymptotic computational complexity of HMC Bayesian inference) and it seems that it is - in the first approximation - O(P^3*N), where : P is the number of parameters, N is the number of samples

So if the above tutorial code example takes 10 seconds to get a fit for 100 data-points then it will take 1.000.000 seconds for 10.000 data points. Is that correct ? Do I get this right ?

Regards,

Jozsef

The complexity is something like the number of gradient evaluations multiplied by the complexity of evaluating the gradient for the log posterior kernel that you define. The latter can be more or less severe depending on the model. For that example, you actually don’t have to do a Cholesky decomposition because the Cholesky factors of the correlation matrices are primitives, so it is basically just lower-triangular solves, which I imagine is implemented in Eigen to depend on the square of the number of observations.

It is difficult to answer your question in general because for NUTS, the number of gradient evaluations is not known in advance and depends on the posterior (which depends on the model as well as the data). Specifically, the treedepth before it U-turns is a random variable, but if it were a constant, then the number of gradient evaluations per MCMC iteration is -1 + 2^treedepth. So, you get big differences in the number of flops depending on whether it can get around the typical set with a treedepth of 3 or 4 vs. 8 or 9.

1 Like

Many thanks Ben !

Very interesting and insightful answer !

I think I might be missing some important point when I am trying to understand your answer.
You wrote :

So it goes with O(N^2) ? The gradient evaluation ? (N is the number of observations/data-points/patients/customers/soccer matches/proton-proton collisions in LHC/etc … ).

I might be mixing up things but the dimension of the correlation matrix is not N but P, right ? Otherwise I might be missing some important point. I admit I have read the HMC paper only very superficially, so I might have missed something fundamentally important.

As I am writing this answer, I did a bit more reading on HMC again: I have read page 3 and 22 of this paper : https://arxiv.org/pdf/1206.1901.pdf and I get the intuition that the correlation/covariance matrix - related to the gradient evaluation in question - is P dimensional - not N - am I wrong ?

Am I missing some important point ? Might be highly likely, I am really not sure.

Cheers,

Jozsef

The free elements of a covariance matrix are among the parameters (what you are calling P), but there are multiple covariance matrices and other parameters as well, but each triangular solve is going to be quadratic in its size.

Ok, so if I understand you correctly then - in first approximation - if I have 1000 datapoints then creating 1 sample point takes half as many FLOPS as it takes with 2000 datapoints. Do I get this right ?

So the computational cost of creating one MC sample as a function of the number of datapoints is linear, right ?

That is more approximate than a first approximation (for NUTS). Having twice as many data points can easily affect the 2^treedepth, although it can make it larger or smaller.

As Ben notes the cost of HMC is proportional to the cost of the gradient calculation times the number of gradient calculations needed to generate a new proposal. The former scales in readily-analyzed ways with the amount of the data but the latter does not. See Chains stuck when use larger dataset, but not smaller for some discussion. Exchangeable Gaussian mixture models are particularly poorly-identified and some of the pathologies can be amplified as more data is introduced.

Thank you for your answer. Where can I find information on these “ways” ?
Is it O(N) , O(N^2) , O(N^3) ?

Cheers

Jozsef

It’s standard complexity analysis of the density calculation implied by your model. Some models, like IID likelihoods, are \mathcal{O}(N), some, like Gaussian processes, are \mathcal{O}(N^{2}), and in general they can be arbitrarily bad.

The whole point of Stan is that you specify your own model, which will determine performance, predictive power, etc.

Oh I see, so basically the scaling with N is simply how the density calculation scales with N ?

So basically the total complexity is something like :
density=f(X, p)
X is the data, with N points
p is the parameters
so if calculating f(X, p) takes constant Z flops and in total STAN needs to call f M times
then the total number of flops is M*Z.

So if the complexity of f is linear then for 2*N data - assuming that STAN calls f again M times - the total number of flops is 2*M*Z ?

Is that a correct understanding ?

So the total flops is "flops needed to evaluate f" * “number of evaluations of f by STAN”.

Is that correct ?

The gradient calculation will scale in this way, but the number of gradient calculations will in general also depend on the structure of the data, as discussed above, making the overall scaling much harder to figure out for bespoke models.

Yes, that I understand, but first I wanted to make sure I get the ‘easy part’ right.

Basically STAN is a sophisticated high dimensional “integrator”.

Here comes a sort of more-or-less related, follow up question, just that I can get basic understanding of the MCMC ‘business’ and put this ‘integration’ in my head into the right place:

Is it so that STAN “simply” calculates P(B) (in the formula below, taken from here) ? I give P(B|A) and P(A) and to get P(A|B) I “only” need to know P(B) (in physics they call this Z, the partition function, if I remember correctly). Is this right ? I mean, once I know P(B) I know everything, right ?

Of course then I might want to calculate some conditional expectation value ( given B) of some function, where I need to integrate over A, then STAN can do that as well, but in the first place it calculates P(B) and after that P(A|B) can be further used for calculating all sorts of conditional expectations (by integrating over A). Is this a correct understanding ?

image

This is on the right path. Stan does not estimate Z (aka the marginal likelihood or the evidence), though. What Stan does is to sample directly from P(A | B) such that one can then compute expectations of the form I_p[f] = \int_{\mathcal{A}} P(a | B) f(a) da.

Take a look at this talk by @betanalpha for more intuition behind Stan as a “high dimensional integrator”.

Thanks, I need to wrap my head around this a bit. Intuitively I can understand this “integrator” if P(a|B) is constant.

What somehow makes me wonder is how da is “handled”/represented in Stan: I guess I can think of da as a voxel in an equally spaced high dimensional grid.

So if Stan generated N samples and in a given voxel a there are M(a) samples. Then P(a|B)da \approx M/N , right ?

This makes me wonder how is it possible to calculate P(B) ? Considering P(B)=P(B|A)*P(A)/P(A|B), we can get an estimate for P(B) for each voxel da , and the best (?) estimate is to calculate the mean from all these estimates, sum them up and divide by the number of terms in the sum.

Of course not for all da voxels will be possible to estimates for P(B) because some voxels will have no samples in them, but then they are just simply excluded from the calculation of the mean.

This makes me wonder that probably Stan is not working directly with voxels , or is it ? I can imagine that it does, or maybe not ?

Nevertheless, I need to look into that video before making too many “educated” guesses, thanks for the tip!

Sounds like you need to understand Monte Carlo integration. Others may have other (better resources) but the blog post linked should be enough to get you on the right track.

Thank you ! This is super good ! I give it a quick read, that will clarify stuff !

I think this is the voxel analogy I was trying to make :)

.

This is the bit which was a bit … note immediately intuitive… but the voxel analogy
works here pretty nicely I think. On an intuitive level. I think I get the rest, on an intuitive level :)