Parallel computation Chains over nodes of a cluster. Parallel computation of log_likelihood within a node

Dear Stan Team,
I am working on a larger Dataset of different Time series of experiments. Larger number of Observables the maximum calculation time on my of 8days cluster runs out. I would like to profile whether my assumptions about the time spend calculation the likelihood is the bottleneck of my stan model (since there are many data points and each of it requiring some matrix inversion, multiplications etc…). It is mathematical obvious that one could parallelize those for loops over each time trace to different CPUs while calculating each chain on a different node.

Are there any Helloworlds or Tutorials for those issues? I haven t found a helping documentation for manipulating the c++ code of stan model. The pystan stanc function delivers me some c++ code for some class but I don t really find a starting point to understand this class.

Especially where do I find some main function in which I can retrace the Programmblocks like the model{}and the transformed Parameterblock{}?

Where do I find of the C++ code which distributes the chains to different cores which I would like to change to distribute over nodes of the cluster?

Are there ways yet, to calculate the trajectory on the parameter space in parallel for a model with Hidden State space properties… such as Hidden Markov or a Kalmann filter?

Thanks for your comments and hints in advance.

Jan

Before you start with parallel stuff, and before you starting running models for long periods of time, you’ll want to do some work with simulated data to make sure your model is doing what you think and doing it efficiently. If you’re curious what I mean by that: https://www.youtube.com/watch?v=ZRpo41l02KQ&list=PLuwyh42iHquU4hUBQs20hkBsKSMrp6H0J&index=6&t=0s and then https://www.youtube.com/watch?v=6cc4N1vT8pk&list=PLuwyh42iHquU4hUBQs20hkBsKSMrp6H0J&index=7&t=0s

If you’ve already done that and the diagnostics pass (http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html), then head to computational stuff. But it’s really hard computationally to make up for some things than can be fixed be reparameterizations or scalings or whatnot.

Anyway, for the parallel stuff, the place to start is map_rect (instead of hacking the C++ directly). I dunno if the docs are up, but this is a thread where folks have been working through an example using it: Linear, parallell regression

1 Like

Regaerding map_rect and doc, check section 22 of Stan 2.18. User Guide

Hey bbbales2, thanks for your immediate help. Maybe a few comenst to my workflow.
Sofar I am just working with simulated Data of a toy model in different Datasettings. Such that I know what the right model is. So with posterior predictive checks I might shrink down the space of one Parameter easily maybe with little luck for more. Since the models will get more complex with real data and the algorithm scales, if i remember right N_states^3, I am defiantly running into Problems. They already occurred now just I see no with a slide change in the Data.

1.a But in general does a weakly informative a informative decrease sampling from the posterior? (Probably yes? But I don know enough about HMMC to answer that)
1.b I guess the size of a parameter interval does influence the sampling time!? Or just the warmup?

  1. Does it decrease computational time if I use a beta distributions instead of uniform distributions as a prior such that at the edges of the interval the probability drops in continuous fashion to zero. I mean like really flat betas…

I gonna proceed withe the other videos and see… I get further ideas from them.

Does map_rect parallelize over nodes or just over all processors of one node?

Thanks for your help. Mr bbbales2 and ermeel

N^3 sounds like a linear solve! What type of model is it?

Tighter priors that include the truth will probably allow the sampler to run faster. Making priors tight when they are not in the right place probably isn’t going to do any good though. This is what the prior predictives are for – trying to put the priors vaguely in the right place and on the right scale with the right amount of regularization to hopefully make the sampling fast.

This can be especially true with hard constraints. If you specify a parameter with a hard constraint and the data implies something that violates that constraint, it’ll run up against the edge of the constraint and have difficulty moving around.

Warmup and sampling both use the same model. They both use the NUTS sampler as well, though in warmup various NUTS parameters are being selected to make the model run efficiently, so you don’t get to treat the warmup draws like a regular MCMC chain.

If you’re trying to mimic a uniform with a beta, probably not. The sampler isn’t actually sampling on a bounded interval if you specify a bounded parameter. Check out the Transformation of Constraints section of the manual.

map_rect parallelizes over cores, so they could be cores of one computer, or cores of different computers. I don’t think there’s a huge

Hi bbbales,

sorry Mr Bales i wasn’t really in the office the last weeks. Its a Kalman Filter which running partially in parallel now :).

Jan

Ah, if they’re Kalman filters, any chance you have a tridiagonal matrix?

You’ll want to do your solves manually if so. Stan doesn’t have sparse linear algebra stuff in place yet, but doing full matrix solves when stuff is actually sparse is rough.

What would you do with the tridiagonal matrix? Neither my transition matrix or the the observation matrix are. The model is singular… the observation matrix in general is not invertible and has couple of zeros in it. The dimensions of the observation space are much smaller then the state space.

It’s easy to write simple Stan code to solve tridiagonal matrices.

Have you seen the “Gaussian Dynamic Linear Models” thing in the manual? Is that anything like what you’re doing?

It’s section “61.7. Gaussian Dynamic Linear Models”, page 553 in the 2.17.0 manual.

Unfortunatly not quite if I understand y ~ gaussian_dlm_obs(F, G, V, W, m0, C0) correctly.

You assume for all data a constant W, which is the correlation matrix for the noise in the state updates. Right? In my case W is time depended due to a dependency onto the current state and parameters which also construct G only the observation matrix F and Transition Matrix G are constant in time.

Looks like it :(. Bummer! Your problem is hard, haha.

Back to your original question, hacking away at the C++ is possible (https://cran.r-project.org/web/packages/rstan/vignettes/external.html, or search around the forums). The danger to this is that a lot of time is spent in the autodiff backwards pass, which isn’t as easy to profile as the forward pass.