Crossed random effects and map_rect()

Quick description

I have a multivariate model where the data are partitioned by three independent grouping factors (a,b,c); almost all combinations of each factor are present in the data in nearly equal numbers. Each response variable is a function of one or more latent variables, which are calculated from combination of population-level regression coefficients and random intercepts for each group.

I am considering parallelizing the model with map_rect(), and am debating how to split the model parameters into phi (global) and theta (local). The parameters I’ll be using in the mapped function are intercepts, betas (population-level regression coefficients), r_a, r_b, and r_c (random intercepts). The intercepts and betas will certainly go into phi, but I’m not as certain about the r_*. Each random effect level will be used in multiple shards, but each r_sum[i] = r_a[aa[i]] + r_b[bb[i]] + r_c[cc[i]] is only used once.

Main question:

Does anyone have any intuition as to what effects making r_sum local vs. keeping all r_* in the global parameters would have on the expression graph?

Extra details on the model

I’m working on what is essentially a Poisson-Multinomial model with an unknown N, where

\vec y_i \sim \text{Multinom}(N_i, \vec \theta_i)
N_i \sim \text{Poisson}(\lambda_i)

In practice, N_i is marginalized over and the model is expressed as a series of Poisson likelihoods of the type y_{k,i} \sim \text{Poisson}(\lambda_i \theta_{k,i}).

I believe the underlying process can be conceived as individuals transitioning along a sequence of states (3 or 7, depending on the data set). Each individual has the opportunity to stay in its current state or transition to the next; after a certain amount of time, the number of individuals in each state is counted, which becomes y_i.

The multinomial proportions \vec \theta are calculated from the transition rates \vec p as
\theta_{k,i} =(1 - p_{k,i}) \prod_{j=1}^{j=k-1} p_{j,i}, with the first and last states being \theta_{1,i} = (1 - p_{1,i}) and \theta_{K,i} = \prod_k p_{k,i}, respectively.

I am trying to estimate how a group of covariates (X) and grouping factors (a,b,c) affect \lambda and \vec p. I’m modeling them as:

\text{logit}(p_{k,i}) = \mu_k + X_{ [i,] } \vec \beta_k + r_{1,k}[a_i] + r_{2,k}[b_i] + r_{3,k}[c_i], with a similar formula for \lambda. r indicates a partially pooled parameter (i.e., a random effect) for one of the grouping factors.

Profiling

I used Stan’s profiler to benchmark this with a relatively small subset of the data; the two parts of the model that contributed to the bulk of the runtime were defining the linear predictor (\vec \theta as a function of p) and the Poisson likelihood. The likelihood was implemented with reduce_sum() on a 48 core machine, and took about 5 times longer to run than the linear predictor. Within each reduce_sum() thread, I’m using a single vectorized poisson_log_lupmf(); switching to _lpmf has minimal performance changes.

I am not an expert on paralellization, but since nobody answered, I’ll give it a try.

If there are just three groups, I’d expect the difference between splitting the three associated parameters and sharing them across all slices to be minimal - the main overhead in paralellization is passing the parameters to each slice and then getting the gradients back. Passing 2 values more in each direction should have small impact.

With Poisson likelihood and large datasets, there might also be savings to be made with the “sufficient statistics” trick. I.e. if some K rows in your dataset have identical values for all predictors, you can collapse them into one where you multiply the linear predictor by K and treat the sum of all outcomes as the single outcome: Sufficient statistic - Wikipedia

Hope that helps at least a little bit.

1 Like