Partly vectorised categorical / dirichlet model runs slower than non-vectorised model

joffdd · April 30, 2025, 10:01pm

I’m trying to fit a hierarchical model with a Categorical likelihood and some hierarchical Dirichlet priors. I’ve worked on quite a number of models and have made the experience that vectorisation, even though sometimes making my code uglyer, for the Dirichlet statements makes the sampling much faster.

I’ve now encountered a case where the sampling is around 10-20% slower compared to the non-vectorised version. I tried to make sense of what I see but couldn’t, it seems unlikely to me that the overhead of writing an array for the vectorised operation can be so significant.

I might oversee something here. I’ll upload the two model versions.

By the way, I would really love to see if the Categorical distribution would be vectorised, but the related issue seems inactive. Right now, I have to rely on a for loop since the right hand side is different for every call.

I also wanted to ask whether somebody sees a smarter way, based on the non-vectorised version, to reduce the two-dimensional arrays theta (and / appended to) phi so that I can feed it into two (ideally one) Dirichlet statement(s) without manually recreating these lhs/rhs arrays?

model$sample(
  data = stan_data,
  chains = 4,
  iter_warmup = 1000,
  iter_sampling = 2000,
  refresh = 50,
  parallel_chains = 4,
  save_warmup = 1,
  seed = 1458L
)

This runs on 4 cores and on my machine takes around 640 vs. 560 seconds.

vectorised.stan (1.5 KB)
non-vectorised.stan (1.0 KB)
baseline_stan_data.rdata (696 Bytes)
Please note: I had to rename the .rds file to .rdata because of an upload restriction.

Bob_Carpenter · May 2, 2025, 9:25pm

Are you getting the same answer with both Stan programs?

Vectorization is not always faster in Stan. The underlying C++ loops are very fast, so it’s only an advantage when there are shared computations or the reduction in the size of the autodiff tree outweighs the memory pressure in constructing the vectors.

For example, when we have normal_lpdf(y | mu, sigma) and sigma is a scalar, but y (and perhaps mu) is a vector. Then Stan’s smart enough to only calculate log(sigma) once. This is a huge saving as it’s the dominant cost in evaluating normal_lpdf(y[n] | mu[n], sigma). But with your Dirichlet example, there’s no shared work across elements.

Sometimes vectorization still helps. For example, even if we have

for (n in 1:N) {
  y[n] ~ std_normal();
}

it will almost certainly be faster to call y ~ std_normal() because no additional items need to be allocated in memory and filled and it cuts down on the size of the autodiff tree.

Topic		Replies	Views
Much smaller benefits when using vectorized beta_binomial_lpmf compared to normal_lpdf Modeling performance	7	953	August 22, 2018
Vectorization and improving speed for a complex mixture type latent trait model for response times Modeling	11	723	January 16, 2025
Is there a way to vectorize my code that contains a custom distribution? Modeling rstan , techniques , specification , performance	1	280	October 4, 2023
Attempt at vectorizing mixture results in a slight reduction of speed General	1	53	March 18, 2025
How speeding up my stan code? Modeling	5	3301	September 23, 2018

Partly vectorised categorical / dirichlet model runs slower than non-vectorised model

Related topics