Awkward (inefficient?) syntax with multidimensional parameter arrays

I am trying to write models in which

(1) parameters are indexed arrays, such as theta[age, sex, region]. In my particular problem there are A=90 ages, S=2 sexes, R=402 regions.

(2) there is prior information about likely parameter differences across different dimensions, such as
(a) theta[a,] - theta[a-1,] ~ N(0,sigma_age) for a=2…A
(b) theta[,1,] - theta[,2,] ~ N(0,sigma_sex)
(c) theta[,reg1] - theta[,reg2] ~ N(0,sigma_reg) for many (reg1,reg2) pairs
(d) etc.

Stan often frustratates my attempts at efficiency because operators like “-” and “+” are undefined for arrays. For example,

parameters {
real theta[A,2,R];
}

transformed parameters {
real age_diff[A-1,2,R];
real sex_diff[A,R];

for (a in 2:A) {
age_diff[a-1] = theta[a] - theta[a-1];
}

sex_diff = theta[1:A, 1, 1:R] - theta[1:A, 2, 1:R];
}

is not allowed because “-” can’t be used for elementwise subtraction on arrays.

Similarly, it seems that distributional statements like

age_diff ~ normal(0, sigma_age);
sex_diff ~ normal(0, sigma_sex);

are not allowed because the left hand side cannot be an array. I end up converting with

  to_array_1d(age_diff) ~ normal(0, sigma_age);
  to_array_1d(sex_diff) ~ normal(0, sigma_sex);

but I’m concerned that the recasting via to_array_1d is slowing the program. Recasting is also necessary, as far as I can tell, for many intermediate calculations involving arrays and sub-arrays that I’ve omitted here.

Are there any standard work-arounds for these sorts of problems with multidimensional arrays? Does recasting to and from multidim arrays really slow things down? Are there especially efficient ways to write the distributional statements?

Thanks.

Store arrays of vectors or matrices and don’t worry about looping over them, it’s pretty fast. Vectorization in Stan doesn’t speed up things the way it does in R.

Is that generally true for (1) intermediate calculations, (2) distributional statements, or both?

I’ve experimented quite a bit with different data and parameter structures (arrays of matrices or vectors, very long 1d arrays with careful indexing to get the right (age,sex,region) combinations, …,… ), but the sampling is still extremely slow compared to other Stan models that I’ve run.

So vectorization can be really important if you use the vectorization operations that reduce the amount of auto-diff that happens. For the most part don’t worry about addition/multiplication, but distributional calculations can be very expensive and some of them have been optimized under the hood for vectorized versions).

1 Like

That was intentional, not an oversight. If you want to use arithmetic on containers, use vectors or matrices.

Almost all of them have. Loops are fast in Stan, but calling anything that has a derivative (any function of a parameter) isn’t. When you use vectorized forms, we can reduce the size of the expression graph and reuse shared computations. The chapter in the manual on optimization says more.

Usually, improving mixing is much more important than improving efficiency once everything’s been reasonably vectorized. That often means choosing easier-to-sample parameterizations.

This is putting multiple priors on theta, which means they multiply (add on the log scale). Not sure if that’s your intention, but Stan lets you do it. Your model winds up having complex multivariate priors, rather than ones that can be factored using something like this:

p(theta | ...) = PROD_i PROD_j PROD_k theta[i, j, k] ~ ...;

This is going to defeat things like the non-centered parameterization, so you’re going to wind up with strong correlations between sigma_reg, sigma_age, and theta in the posterior — it’s the classic funnel problem described in the manual.

It’s not technically a cast in the programming language sense, so it really is allocating more memory and copying. But as @sakrejda points out, that’s rearely a bottleneck.