Example for reparametrisation of multivariate normal parameters in a mixed effects model

Hi
I’m having to code up quite a complex model in stan, which is running incredibly slowly. I was advised to try to reparameterise the model, to speed up model fitting. So I’m trying to work out the correct way to reparameterise a much simpler three level hierarchical longitudinal mixed effects model, as if I know how to do that, I’m confident I can apply that to the much more complex model (I’m aware that you can fit this particular model in other packages, but this is an exercise to allow me to understand what I need to do with a more complex model!)

I’m trying to reparameterise in two general areas I think? For the fixed effects (labelled with betas), and for the multivariate normal random effects (labelled with b, where b2 are level 2 random effects, b3 level 3 random effects)

For the fixed effects, I’m not sure if the below model is correct - I’m trying to update unit scale parameters beta_L_all_unit, but have the non-unit scale fixed effect parameters beta_L_all to update the longitudinal linear predictor in the calc_Linpred_y function. Have I scaled those parameters correctly?

Secondly, I have two sets of random effects b2_L and b3_L at different levels. have I managed to scale these correctly as well?

In the manual I was reading about the difference between reparametrising and transforming, and transforming needing a jacobian adjustment. Am I correct I don’t need one here (mathematical question apologies)

Lastly, the estimation of the random effects seems to take ages (in my data example, M>300,000, N= roughly 30000, K=5, those proportions would stay for most applications, M much larger than N, N much much larger than K) - this is number of longitudinal measurements M nested within number of inidivduals N at the second level, nested within studies K at the third level. Any advice on how I can speed up their estimation or the model in general would be amazing? I need the b2_L and b3_L to calcualte the longitudinal linear predictor…but don’t necessarily need their values saved or output, as long as I have D_L and A_L

Any advice gratefully appreciated

functions{
  matrix matrixlongver(matrix X, array[] int m){
    matrix[sum(m),cols(X)] Xout;
    int next_row = 1;
    for(i in 1:rows(X)){
      Xout[next_row:(next_row + m[i] - 1),] = rep_matrix(X[i,], m[i]);
      next_row = next_row + m[i];
    }
    return Xout;
  }
  
  matrix matrixlongver2(row_vector X, int m){
    matrix[m,cols(X)] Xout;
    for(i in 1:m){
      Xout[i,] = X;
    }
    return Xout;
  }
    //calculate longitudinal linear predictor
    vector calc_Linpred_y(vector beta_L_all,
                      matrix b2_L,
                      matrix b3_L,
                      matrix X_L,
                      matrix Z2_L,
                      matrix Z3_L,
                      int M,  
                      int ql,
                      int rl,
                      array[] int m_bystud,
                      array[] int m_byind,
                      vector ones_b2_L,
                      vector ones_b3_L){ 
        vector[M] linpred_y1;
        
        matrix[M,ql] b2_L_long = matrixlongver(b2_L, m_byind);
        matrix[M,rl] b3_L_long = matrixlongver(b3_L, m_bystud);

        linpred_y1 = (X_L*beta_L_all)+((Z2_L.*b2_L_long)*(ones_b2_L))+((Z3_L.*b3_L_long)*(ones_b3_L));

        //return longitudinal linear predictor
        return(linpred_y1);
    }
}
data {
  //dimensions of data
  int<lower=1> M;      //number of observations
  int<lower=1> N;      //number of individuals
  int<lower=1> K;      //number of studies
  array[K] int<lower=1> m_bystud;  //number of longitudinal measurements in each study
  array[N] int<lower=1> m_byind;  //number of longitudinal measurements in each study
  int<lower=1> pl; //max number long fixed = pl_nontreat + (ntreat_minus1 * ntreatsets_Long_pl)
  int<lower=1> ql;     //dimension of master D matrix
  int<lower=1> rl;     //dimension of master A matrix

  //longitudinal outcome information
  vector[M] y1;         //longitudinal outcome

  //longitudinal fixed information
  matrix[M,pl] X_L;
  matrix[pl,pl] beta_L_sig;    //sd for fixed effects prior
  real<lower=0> sigma_e_sig;   //involved in prior for sigma_e - sig for longerror
  real<lower=0> sigma_e_nu;    //involved in prior for sigma_e - sig for longerror

  //longitudinal indrand information
  matrix[M,ql] Z2_L;
  real<lower=0> tau_Dl_sig;   //involved in prior for tau_Dl - sd for indrand
  real<lower=0> tau_Dl_nu;    //involved in prior for tau_Dl - sd for indrand
  real Omega_Dl_eta;          //involved in prior for Omega_Dl - corr mat for indrand

  //longitudinal studrand information
  matrix[M,rl] Z3_L;
  real<lower=0> tau_Al_sig;   //involved in prior for tau_Al - sd for studrand
  real<lower=0> tau_Al_nu;    //involved in prior for tau_Al - sd for studrand
  real Omega_Al_eta;          //involved in prior for Omega_Al - corr mat for indrand
  
  int do_likelihood;
  int do_prior;
  
}
transformed data{
  vector[ql] ones_b2_L = rep_vector(1,ql);  //used for matrix multiplication for long indrand
  vector[rl] ones_b3_L = rep_vector(1,rl);  //used for matrix multiplication for long studrand

  matrix[pl,pl] beta_L_sig_chol = cholesky_decompose(beta_L_sig);

}
parameters{
  //longitudinal parameters
  vector[pl] beta_L_all_unit;
  real<lower=0> sigma_e;  //longitudinal error term (sd)

  matrix[ql,N] z_b2_L;               //longitudinal indrand effects.  First so many columns relate to fixed effects not treat related, later ones will be treat realted. what treatment will vary by study.  lable in R afterwards
  cholesky_factor_corr[ql] Omega_Dl;    //Correlation Matrix master long indrand
  vector<lower=0>[ql] tau_Dl;           //variance terms master long indrand

  matrix[rl,K] z_b3_L;                  //longitudinal studrand effects
  cholesky_factor_corr[rl] Omega_Al;   //Correlation Matrix master long studrand
  vector<lower=0>[rl] tau_Al;          //variance terms master long studrand
}
transformed parameters{
  matrix[ql,ql] Dl;     //covariance matrix for long indrand
  matrix[rl,rl] Al;     //covariance matrix for long studrand
  matrix[N,ql] b2_L;               //longitudinal indrand effects.  First so many columns relate to fixed effects not treat related, later ones will be treat realted. what treatment will vary by study.  lable in R afterwards
  matrix[K,rl] b3_L;                  //longitudinal studrand effects
  
  Dl = diag_pre_multiply(tau_Dl, Omega_Dl);
  Al = diag_pre_multiply(tau_Al, Omega_Al);
  
  vector[pl] beta_L_all =  beta_L_sig_chol * beta_L_all_unit;
  b2_L = transpose(diag_pre_multiply(tau_Dl, Omega_Dl) * z_b2_L);
  b3_L = transpose(diag_pre_multiply(tau_Al, Omega_Al) * z_b3_L);
}
model{

  if (do_likelihood){
    vector[M] linpred_y1 = calc_Linpred_y(beta_L_all,
                      b2_L,
                      b3_L,
                      X_L,
                      Z2_L,
                      Z3_L,
                      M, 
                      ql,
                      rl,
                      m_bystud,
                      m_byind,
                      ones_b2_L,
                      ones_b3_L);
    y1 ~ normal(linpred_y1,sigma_e);     //update longitudinal outcome
  }
  //Priors
  if (do_prior){
    //longitudinal parameter priors
    beta_L_all_unit  ~ std_normal();                    //prior for longitudinal fixed effects
    to_vector(z_b2_L)~ std_normal();
    to_vector(z_b3_L)~ std_normal();
    sigma_e ~ student_t(sigma_e_nu,0,sigma_e_sig);     //prior for longitudinal error term
    
    tau_Dl ~ student_t(tau_Dl_nu,0,tau_Dl_sig);     //prior for sig for long indrand
    Omega_Dl~ lkj_corr_cholesky(Omega_Dl_eta);      //prior for corr matrix for long indrand

    tau_Al ~ student_t(tau_Al_nu,0,tau_Al_sig);   //prior for sig for long studrand
    Omega_Al~ lkj_corr_cholesky(Omega_Al_eta);    //prior for corr matrix for long studrand
  }
}

This is a lot of data for Stan.

Great! We always recommend people start as simply as possible and build up. It’s how we code our own models.

There are a bunch of things you can do to make the code a bit faster, but nothing that’s going to change the order by much. For example, you define Dl and Al but never use them. If you really really want them in the output but aren’t going to use them in the model, then move the definition to generated quantities and you won’t have all the autodiff overhead for unused variables. You can also inline definitions, so it’s

matrix[N, ql] b2_L = (diag_pre_multiply(tau_Dl, Omega_Dl) * z_b2_L)';

(I switched to the operator transposition.). Given that you’re transposing, which is expensive, you probably want to redeclare these things as

(diag_pre_multiply(A, B) * Z)'
    = Z' * diag_pre_multiply(A, B)'
    = Z' * diag_post_multiply(B', A')

So if you transpose all the variables Z, B, A in their declarations, this gets simpler. It may show up elsewhere—I didn’t try to track through other uses of these variables that may have other requirements.

Yes—you don’t apply non-constant non-linear functions to variables and then put them on the left-hand side of sampling statements.

You only need a Jacobian adjustment when you apply a function f() to a variable and then sample that. Here’s the simplest case that needs a Jacobian adjustment, where I’m assuming a function f with derivative function df_dx.

  real y = f(x);
  target += log(abs(df_dx(x)));
  ...
  y ~ foo(theta);  // Jacobian needed

On the other hand, if you use it on the right you don’t need to adjust,

real theta = f(phi);
...
y ~ bar(theta);  // no Jacobian needed

Things like this cause a lot of memory pressure with duplicated entires. Are you use you need a matrix here? Also, if you do need it, it can be replaced with the following (it will duplicate rows given row vector or duplicate cols given column vector).

matrix[m, cols(X)] Xout = rep_matrix(X, m);

Sorry that’s not helping with the reparameterization. I wasn’t clear on what you thought needed to be reparameterized here. With the amount of data you have, using the non-centered parameterizations may be worse than using centered ones. It comes down to trial and error, I’m afraid.

Hi thank you so much for the response - it is really helpful.

Al and Dl are in the model as the random effects b3 and b2 respectively follow zero mean multivariate normal distributions with those matrices as covariates - I need to have estimates output of them, but I think in this version of the coding file they were not used directly in model fitting (they were in previous versions). I will shift them to generated quantities if not used in the model

Is this size of dataset then towards the extreme of what Stan can manage to handle? Are there any examples of introducing / increasing the size of the dataset over time? By which I mean starting the model fits using a subset of the data, and then updating model fits using more and more data? (I’m aware that is more a statistical than a how does Stan work question.

I had been reading about other reparameterisation alternatives, and had come across QR decomposition? Are there any recommendations of using the reparameterisation I tried here vs using QR?

So this was because I needed to have the Z^{(2)}b^{(2)} (design matrix for level 2 random effects by level 2 random effects) and similarly Z^{(3)}b^{(3)}. The Z matrices are M rows long, and num columns wide equal to number of random effects at that level. So the function was to kind of expand out the b^{(2)} terms say so that it was the same dimension as Z^{(2)}, and then perform elementwise multiplication (from what you’ve said I’m assuming this is quite a slow approach for this…) I had coded it up using a for loop, and an indicator that calculate each line in turn? But it did not seem faster?