Reducing parameter space when predicting random effect covariances

I am working on a model where I want to predict how the covariances among individuals’ random intercepts across multiple outcomes change as a function of a continuous predictor. I’m going to ignore various efficiency gains in the model (e.g. predicting canonical partial correlations) to focus on the heart of my question. The specific issue I’d like to resolve is how to reduce the number of random effects required to estimate the model.

Consider outcome measures y_1,...,y_p with repeated individual measurements, where I am interested in estimating how the covariance matrix P between individuals’ random intercepts changes as a function of some contextual covariate x (e.g. the age of the subject during the measurement, their sex, the number of training trials, etc.). The same individual may be measured under multiple values of x.

My basic modeling strategy is to express the components of the context-specific correlation matrix as linear predictors, such that e.g. r_{(x=0.5)1,p} = tanh(\beta_{0_r}+\beta_r *0.5) is the predicted correlation between random intercepts for outcome 1 and p at value x=0.5., with \beta_r describing the fixed effect of x on the transformed scale. This approach results in vectors of length c for each non-redundant element of the correlation matrix, where c is the number of unique measured values of x (e.g. all subjects may be measured at time 1, 2, and 3). The same approach is taken for predicting changes in the random intercept standard deviations, \sigma_{(x=0.5)p} = sqrt(exp(\beta_{0_\sigma} + \beta_\sigma * 0.5)).

I then combine these vectors together into c distinct Cholesky factorized correlation matrices R_{(x)} and vectors of standard deviations s_{(x)}, which can be used to generate an array of c distinct Cholesky factorized covariance matrices P_{(x)} for each measured value of the covariate.

matrix[p,p] Px[c];
for(n in 1:c){ Px[n] = diag_pre_multiply(sx[n], Rx[n]);}

This strategy works very well because P_{(x)} are predicted from a much smaller set of parameters and so can easily scale up regardless of the number of unique covariates values (large c). However, the issue now arises when using these matrices to scale the random effects for inclusion in the likelihood function. It would be straightforward if I was only interested in predicting residual (co)variances, but for repeated measurements the individual-level random effects have to be explicitly parameterized. The standard approach to estimating multivariate random effects is to multiply an i x p matrix Z of unscaled z-scores, for i individuals across the p outcomes, by the transpose of the Cholesky factorized covariance matrix P. For the context-specific case, this looks something like (assuming independent residuals)

matrix[i,p] Z[c];

matrix[p,p] Px[c];
matrix[i,p] Zx[c];
for(n in 1:c){ 
   Px[n] = diag_pre_multiply(sx[n], Rx[n]);}
   Zx[n] = Z[n] * Px[n]';}
for(n in 1:N){
  y1[n] ~ normal( col(Zx[context_id[n]],1)[subject_id[n]] + ...);
  y2[n] ~ normal( col(Zx[context_id[n]],2)[subject_id[n]] + ...);

I’ve found this model fits relatively efficiently and recovers accurate parameter values when the number of unique covariate values c or the number of individuals i is very small. However, due to the random effect matrices Z_1,...,Z_c being specified as parameters, it scales very poorly with increasing c and i. E.g. if we measure 1000 subjects across 100 distinct covariate values, then the model needs to estimate 100 matrices of 1000 x p size. For continuous predictors, the value of c could of course be easily much larger for a more complex model.

I am not sure if there is any real way out of this problem, but any advice on how I might reparameterize the random effects to avoid specifying matrix[i,p] Z[c] in the parameter block or otherwise increase the efficiency of its sampling would be much appreciated.

This gives you the multivariate non-centered parameterization. I found it very challenging to write that up in the User’s Guide.

Unfortunately, the part of the code you’re showing here is reasonably tight. The only way to speed it up would be figuring out if there are higher-order operations that can be carried out, but Stan doesn’t support the grouped tensor operations you’d need for the loop over c. (As an aside, I find the loop from n in 1:c and then another loop n in 1:n to be confusing as to why loops aren’t over the same range—we generally recommend something like for c in 1:C and for n in 1:N to make them easier to interpret.)

I think it may be possible to clean up the indexing into Zx. You can definitely save at least a transpose by storing the transpose Px_t as a value instead of Px (which I think means transposing Rx[n]). I don’t immediately see how to make it more efficient.

I’m a little confused on why there are only two columns of Zx[context_id[n]] being indexed here.

P.S. To make your code easier to read for others, I’d recommend following our formatting guidelines (space around operators, space after commas, space after “for”, no space after opening parens, closing braces of loops on their own lines, etc.).