This mostly an out of curiosity, but also related to grant application season.
How hard or likely would be to implement some sort of vectorization of code that uses loops? Meaning that if the compiler figures out that the iterations can be done independently, the loop could be replaced by vectorized code?
So this loop:
for (i in 1:N) {
real cov_s = base_s + eta_ps[IDp[i]] + eta_ss[IDs[i]];
real cov_r = base_r + eta_pr[IDp[i]] + eta_sr[IDs[i]];
real S0;
real r;
real pbo_eff;
if (rows(theta_s) != 0) cov_s += X_s[i] * theta_s;
if (rows(theta_r) != 0) cov_r += X_r[i] * theta_r;
if (multiplicative_s == 1) cov_s = exp(cov_s);
if (multiplicative_r == 1) cov_r = exp(cov_r);
S0 = inv_logit(cov_s);
r = cov_r;
pbo_eff = beta_pbo * (k_eq / (k_eq - k_el)) * (exp(-k_el * time[i]) - exp(-k_eq * time[i]));
muS[i] = S0 / (S0^beta + (1 - S0^beta) * exp(-beta * r * time[i]))^(1 / beta) - is_pbo[IDs[i]] * pbo_eff;
}
If the code can be optimized so that the C++ comes out with one loop, that would be great. But I think the example you gave could result in multiple C++ loops if the compiler cannot figure it out. To give a simpler example, I often see students do something like
vector[K] temp = log(Phi(theta)); // theta is a vector of size K
which results in the apply method being called on each element of theta to create a temporary vector that is then passed to log and then the apply method is called to each element of that to take the logarithms. So, for large K, doing
for (k in 1:K) temp[k] = log(Phi(theta[k]));
could be better.
Part of what makes Eigen magical is that by producing the expression templates, the compiler writes code that just makes one pass over the inputs, when you write some C++ like theta.array().log().sum();. If stanc3 (or stanc4) could generate Eigen expression templates instead of concrete Eigen vectors and matrices, that could speed things up in some cases.
I’ve been working on that a bit but it requires some changes in the stan and math repos so that rvalue and a bunch of math functions can take in general eigen expressions
We could do that, but the problem is then we can’t define custom autodiff. We should be able to fix our vectorization of scalar functions like log and Phi to only use one virtual function call using the adjoint-Jacobian-apply pattern.
What we might be able to do is have something like Phi(theta[k]) produce its own expression template. That’d be a huge change, but it’d solve both problems.
To answer Rok’s original question, I think it could do some auto vectorization and I have some code that does that (well, a very basic rules-based version that works in some cases) because it was really important for the TFP backend. I think it’s here: https://github.com/stan-dev/stanc3/compare/irt2pl#diff-b49246fd9939f572c3f408f101d79740. It’s not at all done or necessarily even worth using but it might give you some ideas.
Re: the expression template issues - I think @stevebronder is right and on it! If just did that stuff we’re always talking about, generalizing the Math library function signatures to take EigenBase (or whatever the correct class was) and having them also return expression templates (and call .eval() as needed, possibly with compiler support) I think that would let automatic vectorization be more of a pure win.
Andrew asked me yesterday to write up a short paper about what major engineering challenges I personally think we need to solve in the future of Stan. I haven’t started yet but I think the most important item on there will be Stan-Math-in-Stan, which would let stanc3 output inlined Math code with Eigen template expressions.
Yes!! I’ve thought about this before but didn’t know if it was coo-coo rambling on my part. Using the math library as templates for operations would be great. The only terrible part of this idea is working through C++ template deduction to figure out which function should be called
I think @seantalts is suggesting code generating the math functions in the model and getting rid of the C++ implementations. I’m curious how this’ll work with our reverse-mode specializations. I don’t think it’ll be efficient enough to just template everything at the prim level.