Partly vectorised categorical / dirichlet model runs slower than non-vectorised model

I’m trying to fit a hierarchical model with a Categorical likelihood and some hierarchical Dirichlet priors. I’ve worked on quite a number of models and have made the experience that vectorisation, even though sometimes making my code uglyer, for the Dirichlet statements makes the sampling much faster.

I’ve now encountered a case where the sampling is around 10-20% slower compared to the non-vectorised version. I tried to make sense of what I see but couldn’t, it seems unlikely to me that the overhead of writing an array for the vectorised operation can be so significant.

I might oversee something here. I’ll upload the two model versions.

By the way, I would really love to see if the Categorical distribution would be vectorised, but the related issue seems inactive. Right now, I have to rely on a for loop since the right hand side is different for every call.

I also wanted to ask whether somebody sees a smarter way, based on the non-vectorised version, to reduce the two-dimensional arrays theta (and / appended to) phi so that I can feed it into two (ideally one) Dirichlet statement(s) without manually recreating these lhs/rhs arrays?

model$sample(
  data = stan_data,
  chains = 4,
  iter_warmup = 1000,
  iter_sampling = 2000,
  refresh = 50,
  parallel_chains = 4,
  save_warmup = 1,
  seed = 1458L
)

This runs on 4 cores and on my machine takes around 640 vs. 560 seconds.

vectorised.stan (1.5 KB)
non-vectorised.stan (1.0 KB)
baseline_stan_data.rdata (696 Bytes)
Please note: I had to rename the .rds file to .rdata because of an upload restriction.

Are you getting the same answer with both Stan programs?

Vectorization is not always faster in Stan. The underlying C++ loops are very fast, so it’s only an advantage when there are shared computations or the reduction in the size of the autodiff tree outweighs the memory pressure in constructing the vectors.

For example, when we have normal_lpdf(y | mu, sigma) and sigma is a scalar, but y (and perhaps mu) is a vector. Then Stan’s smart enough to only calculate log(sigma) once. This is a huge saving as it’s the dominant cost in evaluating normal_lpdf(y[n] | mu[n], sigma). But with your Dirichlet example, there’s no shared work across elements.

Sometimes vectorization still helps. For example, even if we have

for (n in 1:N) {
  y[n] ~ std_normal();
}

it will almost certainly be faster to call y ~ std_normal() because no additional items need to be allocated in memory and filled and it cuts down on the size of the autodiff tree.