Scaling up parameters and observations


#1

Are there results online somewhere that show how the consumption of CPU time and RAM increases as the number of parameters in a model increases? … as the number of observations in the data increases?

I can try it myself on a specific model of interest but would like to know about any existing results I can refer to.


#2

I do not think one can answer in general.
The glib answer is to refer to the curse of dimensionality. Searching over 1 parameter is a search in 1 dimension. 2 parameters becomes N^2 and so on.
This is not so helpful in practice. If your parameters are very independent, then gradient descent will work well. If your parameters have ugly dependencies and there are nasty curves in your probability surface, any method will constantly have to adjust its overall direction. Even the clever NUTS sampler will take a while to negotiate a nasty surface.
Maybe you can say that at least memory needs will grow linearly with the number of data points. Matrices in the parameters are pretty common, but not so common in the number of data points.


#3

For RAM it’s relatively straightforward and if you need a more exact answer one of the other dev’s will know better: There’s only one copy of data and one copy of parameters (in the sampler) so the scaling there is linear. I’m not sure how much ends up getting set up for the auto-diff stack but that’s heavily model-dependent. If you’re streaming output out (CmdStan) you don’t even make a second copy of parameters, otherwise you’re allocating enough storage for n_samples x n_stored_parameters.

For CPU it heavily depends on 1) posterior geometry: all the way for infinite time to sampling more or less like it’s standard MVN. For a given parameter/data set size this is convenient because you can run a standard MVN model to understand the best-case scenario; 2) size of auto-diff stack this is where vectorization is important since many of the C++ functions avoid creating extra vars on the stack and also calculate gradients at the same time as calculating values so they can avoid extra work. If you can reparametrize to make your model look more like standard MVN to the sampler and write otherwise efficient code (vectorized, avoid matrix inverses and the like) you’ve gotten it to be about as fast as possible.


#4

Depends on what you’re doing, of course.

You need memory for the data for all of the chains if you’re running in parallel. That’s easy to compute as the C++ storage for it has negligible overhead.

Each iteration, you have to compute multiple leapfrog steps to simulate the Hamiltonian. The number of such steps is determined by the geometry of your problem, which can get better or worse as data size grows and with model misspecification (badly fitting models are much harder to fit—the Folk Theorem in action).

Each leapfrog step computes the density and the gradient w.r.t. parameters. This is usually fairly constant in memory and time, but if there are nested iterative algorithms, like ODE solvers, then where you’re at in the parameter space can be a big deal. For instance, some portions of parameter space may render the ODE stiff and kill solver performance. For storage, you need to store the entire expression graph (node for each operation, edge for each operand to the operation). You also need to store log N copies of the parameters where N is the number of leapfrog steps required, but this is usually negligible compared to other overhead.

Then in R, I think you need about twice the space required to store all fo the draws in memory.


#5

D’oh, I can’t believe I forgot that.


#6

Thanks to all who responded with such interesting details. The answer is, “It depends,” but the details will help me understand performance as I work with each model type.