Awkward (inefficient?) syntax with multidimensional parameter arrays

schmert · August 31, 2017, 2:15pm

I am trying to write models in which

(1) parameters are indexed arrays, such as theta[age, sex, region]. In my particular problem there are A=90 ages, S=2 sexes, R=402 regions.

(2) there is prior information about likely parameter differences across different dimensions, such as
(a) theta[a,] - theta[a-1,] ~ N(0,sigma_age) for a=2…A
(b) theta[,1,] - theta[,2,] ~ N(0,sigma_sex)
(c) theta[,reg1] - theta[,reg2] ~ N(0,sigma_reg) for many (reg1,reg2) pairs
(d) etc.

Stan often frustratates my attempts at efficiency because operators like “-” and “+” are undefined for arrays. For example,

parameters {
real theta[A,2,R];
}

transformed parameters {
real age_diff[A-1,2,R];
real sex_diff[A,R];

for (a in 2:A) {
age_diff[a-1] = theta[a] - theta[a-1];
}

sex_diff = theta[1:A, 1, 1:R] - theta[1:A, 2, 1:R];
}

is not allowed because “-” can’t be used for elementwise subtraction on arrays.

Similarly, it seems that distributional statements like

age_diff ~ normal(0, sigma_age);
sex_diff ~ normal(0, sigma_sex);

are not allowed because the left hand side cannot be an array. I end up converting with

  to_array_1d(age_diff) ~ normal(0, sigma_age);
  to_array_1d(sex_diff) ~ normal(0, sigma_sex);

but I’m concerned that the recasting via to_array_1d is slowing the program. Recasting is also necessary, as far as I can tell, for many intermediate calculations involving arrays and sub-arrays that I’ve omitted here.

Are there any standard work-arounds for these sorts of problems with multidimensional arrays? Does recasting to and from multidim arrays really slow things down? Are there especially efficient ways to write the distributional statements?

Thanks.

sakrejda · August 31, 2017, 2:18pm

Store arrays of vectors or matrices and don’t worry about looping over them, it’s pretty fast. Vectorization in Stan doesn’t speed up things the way it does in R.

schmert · August 31, 2017, 2:22pm

Is that generally true for (1) intermediate calculations, (2) distributional statements, or both?

I’ve experimented quite a bit with different data and parameter structures (arrays of matrices or vectors, very long 1d arrays with careful indexing to get the right (age,sex,region) combinations, …,… ), but the sampling is still extremely slow compared to other Stan models that I’ve run.

sakrejda · August 31, 2017, 2:24pm

So vectorization can be really important if you use the vectorization operations that reduce the amount of auto-diff that happens. For the most part don’t worry about addition/multiplication, but distributional calculations can be very expensive and some of them have been optimized under the hood for vectorized versions).

Bob_Carpenter · September 4, 2017, 4:30pm

That was intentional, not an oversight. If you want to use arithmetic on containers, use vectors or matrices.

Almost all of them have. Loops are fast in Stan, but calling anything that has a derivative (any function of a parameter) isn’t. When you use vectorized forms, we can reduce the size of the expression graph and reuse shared computations. The chapter in the manual on optimization says more.

Usually, improving mixing is much more important than improving efficiency once everything’s been reasonably vectorized. That often means choosing easier-to-sample parameterizations.

This is putting multiple priors on theta, which means they multiply (add on the log scale). Not sure if that’s your intention, but Stan lets you do it. Your model winds up having complex multivariate priors, rather than ones that can be factored using something like this:

p(theta | ...) = PROD_i PROD_j PROD_k theta[i, j, k] ~ ...;

This is going to defeat things like the non-centered parameterization, so you’re going to wind up with strong correlations between sigma_reg, sigma_age, and theta in the posterior — it’s the classic funnel problem described in the manual.

It’s not technically a cast in the programming language sense, so it really is allocating more memory and copying. But as @sakrejda points out, that’s rearely a bottleneck.

Topic		Replies	Views
Matrix algebra: parameter with two indices Modeling	3	519	August 15, 2018
Syntax error with in transformed parameters block General	3	2247	April 14, 2018
Array function coding error for multlevel logit model with multivariate priors Modeling rstan , specification	20	2154	April 2, 2022
Defining priors for k-dimensional arrays Modeling specification	3	316	September 8, 2023
Multivariate Bayesian Modelling via Stan Modeling rstan	4	519	April 28, 2021

Awkward (inefficient?) syntax with multidimensional parameter arrays

Related topics