## Quick description

I have a multivariate model where the data are partitioned by three independent grouping factors (a,b,c); almost all combinations of each factor are present in the data in nearly equal numbers. Each response variable is a function of one or more latent variables, which are calculated from combination of population-level regression coefficients and random intercepts for each group.

I am considering parallelizing the model with `map_rect()`

, and am debating how to split the model parameters into `phi`

(global) and `theta`

(local). The parameters Iâ€™ll be using in the mapped function are `intercepts`

, `betas`

(population-level regression coefficients), `r_a`

, `r_b`

, and `r_c`

(random intercepts). The `intercepts`

and `betas`

will certainly go into `phi`

, but Iâ€™m not as certain about the `r_*`

. Each random effect level will be used in multiple shards, but each `r_sum[i] = r_a[aa[i]] + r_b[bb[i]] + r_c[cc[i]]`

is only used once.

### Main question:

Does anyone have any intuition as to what effects making `r_sum`

local vs. keeping all `r_*`

in the global parameters would have on the expression graph?

## Extra details on the model

Iâ€™m working on what is essentially a Poisson-Multinomial model with an unknown N, where

\vec y_i \sim \text{Multinom}(N_i, \vec \theta_i)

N_i \sim \text{Poisson}(\lambda_i)

In practice, N_i is marginalized over and the model is expressed as a series of Poisson likelihoods of the type y_{k,i} \sim \text{Poisson}(\lambda_i \theta_{k,i}).

I believe the underlying process can be conceived as individuals transitioning along a sequence of states (3 or 7, depending on the data set). Each individual has the opportunity to stay in its current state or transition to the next; after a certain amount of time, the number of individuals in each state is counted, which becomes y_i.

The multinomial proportions \vec \theta are calculated from the transition rates \vec p as

\theta_{k,i} =(1 - p_{k,i}) \prod_{j=1}^{j=k-1} p_{j,i}, with the first and last states being \theta_{1,i} = (1 - p_{1,i}) and \theta_{K,i} = \prod_k p_{k,i}, respectively.

I am trying to estimate how a group of covariates (X) and grouping factors (a,b,c) affect \lambda and \vec p. Iâ€™m modeling them as:

\text{logit}(p_{k,i}) = \mu_k + X_{ [i,] } \vec \beta_k + r_{1,k}[a_i] + r_{2,k}[b_i] + r_{3,k}[c_i], with a similar formula for \lambda. r indicates a partially pooled parameter (i.e., a random effect) for one of the grouping factors.

## Profiling

I used Stanâ€™s profiler to benchmark this with a relatively small subset of the data; the two parts of the model that contributed to the bulk of the runtime were defining the linear predictor (\vec \theta as a function of p) and the Poisson likelihood. The likelihood was implemented with `reduce_sum()`

on a 48 core machine, and took about 5 times longer to run than the linear predictor. Within each `reduce_sum()`

thread, Iâ€™m using a single vectorized `poisson_log_lupmf()`

; switching to `_lpmf`

has minimal performance changes.