# Crossed random effects and map_rect()

## Quick description

I have a multivariate model where the data are partitioned by three independent grouping factors (a,b,c); almost all combinations of each factor are present in the data in nearly equal numbers. Each response variable is a function of one or more latent variables, which are calculated from combination of population-level regression coefficients and random intercepts for each group.

I am considering parallelizing the model with map_rect(), and am debating how to split the model parameters into phi (global) and theta (local). The parameters Iâ€™ll be using in the mapped function are intercepts, betas (population-level regression coefficients), r_a, r_b, and r_c (random intercepts). The intercepts and betas will certainly go into phi, but Iâ€™m not as certain about the r_*. Each random effect level will be used in multiple shards, but each r_sum[i] = r_a[aa[i]] + r_b[bb[i]] + r_c[cc[i]] is only used once.

### Main question:

Does anyone have any intuition as to what effects making r_sum local vs. keeping all r_* in the global parameters would have on the expression graph?

## Extra details on the model

Iâ€™m working on what is essentially a Poisson-Multinomial model with an unknown N, where

\vec y_i \sim \text{Multinom}(N_i, \vec \theta_i)
N_i \sim \text{Poisson}(\lambda_i)

In practice, N_i is marginalized over and the model is expressed as a series of Poisson likelihoods of the type y_{k,i} \sim \text{Poisson}(\lambda_i \theta_{k,i}).

I believe the underlying process can be conceived as individuals transitioning along a sequence of states (3 or 7, depending on the data set). Each individual has the opportunity to stay in its current state or transition to the next; after a certain amount of time, the number of individuals in each state is counted, which becomes y_i.

The multinomial proportions \vec \theta are calculated from the transition rates \vec p as
\theta_{k,i} =(1 - p_{k,i}) \prod_{j=1}^{j=k-1} p_{j,i}, with the first and last states being \theta_{1,i} = (1 - p_{1,i}) and \theta_{K,i} = \prod_k p_{k,i}, respectively.

I am trying to estimate how a group of covariates (X) and grouping factors (a,b,c) affect \lambda and \vec p. Iâ€™m modeling them as:

\text{logit}(p_{k,i}) = \mu_k + X_{ [i,] } \vec \beta_k + r_{1,k}[a_i] + r_{2,k}[b_i] + r_{3,k}[c_i], with a similar formula for \lambda. r indicates a partially pooled parameter (i.e., a random effect) for one of the grouping factors.

## Profiling

I used Stanâ€™s profiler to benchmark this with a relatively small subset of the data; the two parts of the model that contributed to the bulk of the runtime were defining the linear predictor (\vec \theta as a function of p) and the Poisson likelihood. The likelihood was implemented with reduce_sum() on a 48 core machine, and took about 5 times longer to run than the linear predictor. Within each reduce_sum() thread, Iâ€™m using a single vectorized poisson_log_lupmf(); switching to _lpmf has minimal performance changes.

I am not an expert on paralellization, but since nobody answered, Iâ€™ll give it a try.

If there are just three groups, Iâ€™d expect the difference between splitting the three associated parameters and sharing them across all slices to be minimal - the main overhead in paralellization is passing the parameters to each slice and then getting the gradients back. Passing 2 values more in each direction should have small impact.

With Poisson likelihood and large datasets, there might also be savings to be made with the â€śsufficient statisticsâ€ť trick. I.e. if some K rows in your dataset have identical values for all predictors, you can collapse them into one where you multiply the linear predictor by K and treat the sum of all outcomes as the single outcome: Sufficient statistic - Wikipedia

Hope that helps at least a little bit.

1 Like