I use a Stan model to do inference over hundreds of thousands of small data parcels individually. I.e. for each parcel, the model runs and reports and I run as many as I can in parallel for speed ups. This is unsurprisingly slow however it does seem a lot faster than coding the data into a single model with N groups which have no shared parameters. I guess this is also unsurprising?
My main question is, given the context above, is there any recommendations for how I might go about making the application of the same model very many times faster? I can’t use optimizing in this instance (no continuous gradient) and vb is far too unreliable.
Hm, I have a hard time believing that your data source doesn’t induce any similarity in the parameters associated with these hundreds of thousands of data sets. Also, have you tried coding a version treating the data sets independently but in a single model? There might be some speedups thanks to vectorized computations that you can take advantage of there. But also try to consider a non-independent model where there is partial pooling of information for your parameters.
Yup, I have. It’s an order of magnitude slower than running eobviouslyach model independently. Similarly adding common parameters between groups is several times slower than fitting each group with independent parameters. IMO, this is going to be the case because common parameters increase the interaction between groups and essentially tie all the parameters together into a single complicated gradient.
The model evaluates unseen image segments to determine if it’s a specific type of landmark or not: the vast majority aren’t but some are. It does this by fitting a mixture model and then considering the fitted parameter values against a prior for the valid landmarks.
Ok, well, best I can recommend then is to take a look at the reparameterizations in the manual and see if you can eek out some performance that way, then look at MPI (if you can spend $$ for access to lots of cores) and/or GPU accelerations.
PyStan or RStan?
How long does it take to sample (Stan) vs FrontEnd timing (Python/R)?
How do you parallelize your runs?
I use cmdstanpy so I can use 2.21 under the hood. There’s some overhead due to serialisation and executing processes for every run but it’s < 1% of the total run time. So 99% or so of time is spend sampling.
I parallelise runs with joblib. Basically a parallel foreach. CPU utilisation in 8 cores is ~95%.
I’m not using map_rect because there’s no spare CPU although it may be more efficient then parallelising runs. I’m not using GPU because it isn’t relevant for my mixture model.
I decided to remodel the problem so that I could use a gradient descent based optimizer to solve the final parameters. It meant spending all night with Maxima working out an approximation with a closed form Jacobian and Hessian but we got there :-). It isn’t as good but it went from processing 1000 parcels in 6 minutes to 1000 parcels in 5 seconds.
You really appreciate how awesome Stan’s NUTS/HMC sampler is when you try and replicate the results by other means.