Is there a way of compiling STAN models (through Rstan) using openBLAS or other C++ libraries that have parallelized linear algebra operations, so that I could speed up a run by spreading a costly matrix inversion across multiple processors?

Briefly, my data consist of thousands to millions of independent observations, which I summarize using the sample covariance between samples across observations, and model that covariance as a draw from a Wishart with a parametric scale matrix. This makes everything fast even when the number of observations is very large, which is good, but slow when the number of samples (the dimension of the sample covariance) is large (>100ish). For more background, see my other post here about my model.

Stan doesnâ€™t use BLAS, only Eigen, and parallelizing the Cholesky decomposition on a GPU in Stan is not ready yet.

You can make your model somewhat faster by not calculating L * obs_cov every leapfrog step. Since those are both data, you can calculate L_obs_cov once in transformed data. But using the Wishart PDF is currently slow for large matrices.