Do we have some expression template type of stuff in stan-math? What I am seeing as a use pattern very often in stan-math is a pattern like this when computing the lpdf or lpmf in vectorized form:

// make an objected which is indexed by i, even if n or lambda is only a scalar and not a vector
scalar_seq_view<T_n> n_vec(n);
scalar_seq_view<T_rate> lambda_vec(lambda);
size_t size = max_size(n, lambda);
// then we loop over the terms
for (size_t i = 0; i < size; i++) {
if (!(lambda_vec[i] == 0 && n_vec[i] == 0)) {
if (include_summand<propto>::value)
logp -= lgamma(n_vec[i] + 1.0);
if (include_summand<propto, T_rate>::value)
logp += multiply_log(n_vec[i], value_of(lambda_vec[i]))
- value_of(lambda_vec[i]);
}
// ...
}

Thus, we will compute, for example, for each element lgamma(n_vec[i] + 1.0) - even if n_vec is just a scalar. Obviously it is not too good for the performance to keep recomputing a function like lgamma over and over again when all we need is a single evaluation. This even gets worse in case whatever scalar is itself a var as we then increase substantially the AD tree in an unneeded way if I am not mistaken.

This is a very common pattern we use. A solution could be to introduce a scalar_seq_view_expr object which would act a bit smarter for the case when there is indeed only a scalar.

… or is this all unneeded, because the compiler is smart enough / this would not buy us performance?

Do we have such a facility already? My intuition tells me that this could buy us quite some extra speed… which I will try with a POC shortly.

Oh boy! I am getting a close to 2x speed improvement when avoiding redundant lgamma calculations as these are anyway always the same. This is for the example as in Parallelization of large vectorized expressions … So this type of optimization will buy us a lot in cases when we do redundant computations as some terms are scalars and we ignore that.

So we go from 1650ms in that example to only 920ms execution time. That’s a lot and I am sure that this type of optimization can be applied a lot. This POC is also in the stan-math branch parallel-lpdf.

One cheap way to get this optimization is to introduce seq or vector view objects which can apply a function. The function would be by default identity, but passing in a lambda function would transform the view using that lambda function. For real n>1 sequences nothing will change while for n=1 scalars the function is only ever computed once. This would be much easier to implement rather than expression templates.

That would solve it, but this type of pattern is all over in the codebase. We should find a principled way for this (for example, the sigma in normal lpdfs is often the very same for all measurements, but we take per observation the log or it’s inverse).

I think we should introduce a VectorViewExpression or something similar which is smart about these things.

We do a lot better by looping over the different sizes and creating vectors of their appropriate length. For an example, see the normal_lpdf implementation. This just looks like it’s been coded inefficiently.

for (size_t i = 0; i < length(sigma); i++) {
inv_sigma[i] = 1.0 / value_of(sigma_vec[i]);
if (include_summand<propto, T_scale>::value)
log_sigma[i] = log(value_of(sigma_vec[i]));
}

Very often sigma will be the same for all observations, but we will brute-force calculate the log and it’s inverse for all elements even if we only need to do this once. For these types of loops we should have something like a VectorViewExpression, I think (something which avoids the loop in the case of a scalar value represented through VectorView).

Yeah… is there a name for what I have in mind (in case it became clear)?

If I’m understanding your observation correctly, you are misreading the code.

That loop is over the size of sigma. If there’s 1 sigma, that happens once. If there are N, it happens N times.

The vector view on inv_sigma allows us to loop over inv_sigma and return the correct cached value.

The code that you originally found should be broken into two loops with a temporary variable to hold the result of the lgamma function. Then use that within the loop. That will allow for only calling lgamma the correct number of times.

There is another optimization that could be done: multiply the value by N if it’s one. But that’s really minor.

Ok, I filed the issue. Studying our codebase with a little more attention to the details I noticed that the VectorBuilder is really close to what I was suggesting above.

… stan-math is quite a complex beast. Having worked for a few years on it, I still discover new things.

Sounds like a great idea. This should be easily doable via the stan-language itself. It would be interesting to see how well a custom poisson_log_approx_lpdf coded in Stan performs vs the internal one. I would guess that the performance gains for large data sets are significant. I will certainly try this out on a given occasion.

Yes, the basic views do that for us. I think the lgamma was just implemented improperly. I thought you figured that out in another thread (or in an issue?).

That’s totally in line with what we’ve seen elsewhere. That’s why I said I think we need to audit a bunch of these implementations if you’re finding efficiency coding errors like these.

We haven’t gotten to that leve of approximation. Part of the problem is we don’t really have the expertise or time to try to evaluate these approximations.

If someone wants to do this and show that it works, it’d be great. I see this kind of thing in the more advanced math libs.

I thought we were going to go to the std lib versions now that we’ve adopted C++11.

Either way, I’d assume, like @sakrejda, that they’ve been well implemented with approximations for various pieces of the domain.

We might be able to get away with less than library accuracy in many cases.

Thanks for pointing out it’s still the Boost version. I still believe the plan’s to swap in the C++ standard library version of all the functions added to C++11.

Don’t worry—I’m very cautious.

We have problems in tails with Phi_approx.

And if we go much below 1e-6 precision for solving ODEs, the HMC sampler starts degrading quickly.