R2D2M2 prior for non-GLMMs

Hi,

I’ve been intrigued by the R2D2M2 paper and am interested in applying it for some ecological models. I’m currently working a dynamic occupancy model with sites nested in regions, and each site has random region-level and random site-level contributions to the parameters of the dynamic occupancy model, namely the colonisation, emigration, and detection rates.

I’m sure the answer is it depends but would I be able to specify the R2D2M2 prior for this model on all of the region- and site-level random effects, or would it be more complicated for a model like this?

Thanks,

Matt

1 Like

Update: I think I was able to apply the R2D2 formula for region-level parameters of this hidden Markov model parameterised with an overall occupancy probability (psi), colonisation and emigration rates (gamma and epsilon), and call rates (mu, for the detection model). The region-level contributions are drawn from a multivariate normal. Below, in the transformed parameters, I first show the normal non-centered parameterisation of multivariate random effects. Then I show what I believe is the correct formula for R2D2.

parameters {
  real<lower=0, upper=1> psi_a;
  real<lower=0> gamma_a, epsilon_a, mu_a;
  vector<lower=0>[4] theta_t;
  cholesky_factor_corr[4] theta_O_L;
  matrix[4, n_region] theta_z;
  real<lower=0, upper=1> r2d2_r2;
  simplex[4 * n_region] r2d2_phi;
}

transformed parameters {
real r2d2_tau2 = r2d2_r2 / (1 - r2d2_r2);
if (!r2d2) {
    theta = (diag_pre_multiply(theta_t, theta_O_L) * theta_z)';
    } else {
      theta = (diag_pre_multiply(theta_t, theta_O_L),
               / theta_z
               .* sqrt(r2d2_tau2 * to_matrix(r2d2_phi, 4, n_region)))'; 
    }
    psi_r = inv_logit(logit(psi_a) + theta[:, 1]);
    gamma_r = exp(log(gamma_a) + theta[:, 2]);
    epsilon_r = exp(log(epsilon_a) + theta[:, 3]);
    mu_r = exp(log(mu_a) + theta[:, 4]);
  }
}

I suppose my question is, does that make any sense when there’s no residual variance? Looking at the formula it seems like you’re somehow scaling the total variance by the parameter-level contributions to that variance so maybe you don’t need the residual? Anyway, I think I’m starting to understand it better but could probably help with some guidance.

Thanks!

For Gaussian family, R2D2(M2) scales with sigma. In case of non-Gaussian exponential family, the regularized horseshoe paper Sparsity information and regularization in the horseshoe and other shrinkage priors recommends scaling with pseudovariance, and the same approach would make sense for R2D2(M2), too. It is possible that for R2-type priors there are other ways, and we’re investigating this, but at this point the pseudovariance approach is what I would use.

OK, thank you! Just checking, instead of dividing by the SDs of the covariates in the typical R2D2(M2) approach, if the “effects” are non-centered random effects, would the estimated SDs of the random effects take the place of the SDs of the cvoariates?

Oops, sorry, I read your question wrong. I was replying to prior for non-Gaussian model and just for the change due to different data model. Maybe R2D2M2 authors @javier.aguilar and @paul.buerkner can comment applicability for your hierarchical model.

Hello Matt

I am glad that you have found our work interesting. We are currently trying to develop an extension of this prior to other families and models.

Looking at the formula it seems like you’re somehow scaling the total variance by the parameter-level contributions to that variance so maybe you don’t need the residual?

The total variance is the sum of the individual variances that correspond to the overall and varying coefficients. The paper proposes to scale the variance of each coefficient by the residual variance so that we can factor it and have a one to one relationship between R2 and the total variance.

I am not familiar with the specifics of your model but if it allows a decomposition of the linear predictor in the same way we present in the paper then I would not see an issue on applying the prior.

The R2D2M2 prior is available within brms so you could also try to run your model with it. See here.

Just checking, instead of dividing by the SDs of the covariates in the typical R2D2(M2) approach, if the “effects” are non-centered random effects, would the estimated SDs of the random effects take the place of the SDs of the cvoariates?

If I understand your question correctly, you are wondering if it is necessary to scale by the standard deviation of your covariates as we suggest in the paper? I don’t believe you could scale by the SDs of the random effects since the R2D2(M2) prior places a prior on them and is trying to perform inference on them. In the worst case you could just ignore the scaling. This could change the performance of the prior though.

If you would like to have a further discussion via a video call or email, feel free to send me a message. I would be interested in making it work.

2 Likes

Hi Javier,

Thanks for getting back to me.

I think I had a few fundamental misunderstandings about this prior and paper. I think because of the M2 part I thought it referred to hierarchical random effects, which I’m not sure about now. So I thought instead of by dividing by the SDs of observed covariates, you could divide by the SDs of the estimated random effects. I will definitely keep R2D2 in mind for models with many coefficients, but I have moved on to a different prior for this project.

Thanks!

1 Like

Hey @javier.aguilar, sorry to rehash this old topic but I thought it would be preferred to starting a new one.

Has the R2D2 prior been extended to multivariate response yet?

Edit: I wanted to add how far I got myself. If this is a simple example of 1-dimensional normal regression with R2D2 where covariates were standardised beforehand (note, I haven’t run this model just wrote it out in Stan code):

data {
  int N, X;  // number of observations and covariates
  matrix[N, X] x;  // covariates
  vector[N] y;  // data
}

parameters {
  real alpha;  // intercept
  real<lower=0> sigma;  // residual SD
  real<lower=0, upper=1> R_sq;  // R-squared
  simplex[X] phi;  // partitions
  vector[X] z;  // coefficient z-scores
}

model {
  // R2D2
  real tau_sq = R_sq / (1 - R_sq);
  vector[X] beta = z .* sqrt(phi * square(sigma) * tau_sq);
  
  // linear predictor
  vector[N] mu = alpha + X * beta;
  
  // likelihood
  target += normal_lpdf(y | mu, sigma);
  
  // priors
  target += normal_lpdf(alpha | 0, 10);
  target += exponential_lpdf(sigma | 1);
  target += beta_lpdf(R_sq | 1, 1);
  target += dirichlet_lpdf(phi | rep_vector(1, X));
  target += std_normal_lpdf(z);
}

Then I thought it would generalise like this to the multivariate case, where in the model block I only show the computations for the R2D2 stuff. This involves matrix-multiplying the partitions by the covariance matrix and then tau_sq.

data {
  int N, D, X;  // number of observations, dimensions, and covariates
  matrix[N, X] x;  // covariates
  array[N] vector[D] y;  // data
}

parameters {
  vector[D] alpha;  // intercepts
  vector[D]<lower=0> sigma;  // residual SDs
  cholesky_factor_corr[D] Omega_L;  // Cholesky factor of correlation matrix
  real<lower=0, upper=1> R_sq;  // R-squared
  matrix[X, D] phi;    // partitions
  vector[X, D] z;  // coefficient z-scores
}

model {
  // R2D2
  real tau_sq = R_sq / (1 - R_sq);
  matrix[D, D] Sigma_L = diag_pre_multiply(sigma, Omega_L);  // Cholesky factor of covariance matrix
  matrix[X, D] beta = z .* cholesky_decompose(phi * multiply_lower_tri_self_transpose(Sigma_L) * tau_sq);
}

I wold love to hear your thoughts.

Hello! Sorry for the late reply.

The first block that you have is the R2D2 for the univariate case if I understand correctly. The R2D2 prior is implemented within brms in case you would like to try it out. See here Your implementation seems correct.

We have discussed before with @paul.buerkner on how to approach the multivariate case (multiple responses?), however I believe we need to think about a proper R2 measure in this case. The univariate R2D2 prior is built by placing a prior over the population R2 and then distributing the uncertainty, however in this case I would not know which definition of R2 we are using. For instance we could ask the following: Should we partition tau over the responses (and perhaps express some dependency)? Should the variance partitions be considered only within the variable or should we also allow for dependencies across the different responses?

The idea sounds interesting and I would be happy to discuss more about it! I am also curious if this would offer any benefits over standard shrinkage priors.

Hey Javier, thanks for checking in.

I suppose I was just trying to generalise to the multivariate case, which for sqrt is cholesky_decompose and for residual SD sigma is residual Cholesky factor of the covariance matrix Sigma_L.

So then univariate case:

z .* sqrt(phi * square(sigma) * tau_sq);

Should become:

z .* cholesky_decompose(phi * multiply_lower_tri_self_transpose(Sigma_L) * tau_sq);

And indeed the remaining quantity to determine is tau_sq. My intuition was that just because we’re dealing with a multivariate response doesn’t mean that the proportion of explained (co)variance should not be a scalar. And because the covariance matrix is included in the R2D2 prior, you’re already sort of dealing with the dependencies there. But, I’m working off intuitions and inductive reasoning here so I’d love to get some other eyes on it.

Very interesting discussion. @javier.aguilar I was wondering for the multivariate case - would you envisage a separate R2 for each of the outcomes, or there would be some kind of combined R2?

Recently did some work with multivariate longitudinal mixed models, and setting priors was a challenge. If I were taking on such a model again I’d be very interested to use R2D2M2 priors, though I have no clue how I would achieve that at present 😅 (Probably would also be interested to extend to copula models, but that seems like a bigger discussion)