R2D2 prior and Gaussian Processes

jroon · July 19, 2024, 5:17pm

According to the documentation for the R2D2() prior function in brms it can be applied across a variety of classes:

However, I’m having trouble finding explanations/examples of the R2D2 prior in use in GP’s and I’m struggling a bit to understand the meaning of subtle points. I’m also unclear on this main arugment. I made some example models to explore a bit:

dat <- mgcv::gamSim(1, n = 30, scale = 2)

# fit some GP models
fit1 <- brm(y ~ x1 + gp(x2), dat, chains = 2)
fit2 <- brm(y ~ x1 + gp(x2), dat, chains = 2,
            prior = set_prior(R2D2(mean_R2 = 0.8, prec_R2 = 2, cons_D2 = 0.9), class = "sdgp"))
fit3 <- brm(y ~ x1 + gp(x2), dat, chains = 2,
            prior = c(set_prior(R2D2(mean_R2 = 0.8, prec_R2 = 2, main = TRUE), class = "sdgp"),
                      set_prior(R2D2(mean_R2 = 0.8, prec_R2 = 2), class = "b")))

Ok so if I look at the prior_summaries():

I can see the length scale for the GP doesn’t change. In the second model I see R2D2 on the sdgp term - I guess the shrinkage applies only on the sd term - do I interpert that correctly? Then in fit3 I have R2D2 priors on the class b coefficients too. Here I have questions. I set the main term under guidance of error messages and documentation - but I dont’ really grok it.

Does it just mean there is a common cons_D2 parameter to all R2D2 terms on the b and sdgp class priors set by the prior with main = TRUE?
I can set different mean_R2 and prec_R2 for b and sdgp priors. But I struggle with what this means. For example if I set mean_R2 for the b class to 0.8 but 0.4 to the sdgp class, does that mean b is given priority in the model ? In other words am I saying I think that b class variables are twice as likely to explain variation compared to sdgp class?
The length scale isn’t given an R2D2 prior - is there a reason for this? (side question - where are 1.494197, 0.056607 values coming from?)

mhollanders · July 20, 2024, 2:49am

I was also confused but if you run make_stancode() it’s enlightening. I think the decomposition happens at the level of the scales, whereas I thought it happened at the level of the coefficients.

avehtari · July 20, 2024, 10:44am

When R2D2 is used with smooths and hierarchical components brms uses actually R2D2M2 as explained in Intuitive joint priors for Bayesian linear multilevel models: The R2D2M2 prior

jroon · July 20, 2024, 11:59am

Yes I have been looking at the code already anyway, but I don’t find it that easy to follow. However, it seems that parameter values set for the coef class with main = TRUE are the only ones used. So that kind of answers my first 2 questions above. I still don’t know why the length scale prior is not included int he shrinkage or where those prior values come from.

Oh ok thanks, I didn’t know that.

What I actually want to do is use R2D2 on a GP with automatic relevance determination. So instead of one rho variable I would have a rho for each column of data in the gp predictor, Do you know how I the R2D2M2 prior can be extended to cover this situation?

avehtari · July 20, 2024, 12:21pm

Not supported in brms except for additive GP (separate gp-term for each covariate).

If you tell more about your modelling goal, I can possibly tell if it even makes sense to try to do it with Stan as it has some limitations wth respect to efficient GP computation

mhollanders · July 21, 2024, 1:06am

I’m possibly a bit out of my depth here, but the length-scale is more of a correlation parameter than a variance component. So the variance component of the GP is what’s subjected to the variance partition done by R2D2. I think the main argument has to do with the fact that the R2D2 is a joint prior over potentially lots of model components, so you can only specify the hyperparameters for that prior once. Once you add more components to the R2D2 prior, you’re just adding additional partitions to the variance decomposition, but not specifying new priors over R^2 for instance.

jroon · July 21, 2024, 6:34am

Oh yes I know it’s not supported by brms. However, there don’t seem to be many tutorials or worked examples of R2D2 in-action yet, so I was studying the brms generated code to try to improve my understanding.

Ah thanks that’s helpful. I’ve been working on implementation of a model called bkmr. This model includes a GP regression with ARD applied to a distance matrix. I have a version that I’m relatively happy with, and fits the datasets i’m interested in. But, when doing some prior predictive checking setting priors is really tricky in much the same way setting priors for say a mixed effects regression model is, even tight priors on individual variables don’t much restrict the range of prior predicted values. So, this set me to thinking about R2D2 priors and studying the brms code. But instead of the single sdgp variable in the example fits above, I have a vector of sdgp’s analogous to the ARD example in the Stan Documentation:

So if one applied R2D2 to this ARD model, would the multiple rho’s (sdgp’s) simply extend the number of params included in the R2D2 vector of scales, or would it be more complicated?

jroon · July 21, 2024, 6:37am

Thanks. Yes, I have come to the same conclusions as you on the function of the main argument and actually reading the helpfile for the function again, this is clear for me now.

Interesting comment on the length scale, thanks!

jroon · July 31, 2024, 8:59am

Hi again @avehtari. I needed to tidy up some bits of code before sharing but below is my model.

It’s a version of BKMR, which is an MCMC model written in R for flexibly modelling the effects of mixtures of exposures. The model includes an ARD component using inverse length scales to select the most important exposures (with selection currently imposed by a simple cauchy prior on r). The model below works, but for the binomial output needs a large n to detect signal in simulated data. So, I was trying to tune the priors via prior predictive checks and concluded the priors as currently specified are not well thought out. In particular, as the (inverse) length scale priors go to zero, one would need a more tightly defined prior on lambda. So this set me to thinking about R2D2 or similar priors. However, I don’t know how to set R2D2 on a model like this with a binomial outcome and a mix of linear regression and a GP, and also whether lambda should be included in the R2D2 scales or not (it is not included in the R2D2 scales in the brms R2D2 implementation for GP’s, but I don’t know why).

functions {
    // Calculate distance/covariance matrix for exposures - checked vs bkmr code- matches
    matrix cov_Kzr(matrix Z, vector r) {
        // Define variables
        int n_obs = dims(Z)[1];
        int n_exp = dims(Z)[2];
        matrix[n_obs, n_exp] Zr = Z .* rep_matrix(sqrt(r'), n_obs);
        matrix[n_obs, n_obs] K = rep_matrix(0.0, n_obs, n_obs);

        // calculate K
        for (i in 1:n_obs){
            for (j in 1:n_obs){
                if (i < j){
                    for (k in 1:n_exp){
                        K[i, j] += square( Zr[i, k] - Zr[j, k] );
                    }
                    K[j, i] = K[i, j];
                }
            }
        }
        return K;
    }
}

data {
    int n_obs;
    int n_exp;
    int n_cov;
    
    array[n_obs] int y;
    matrix[n_obs, n_exp] Z;
    matrix[n_obs, n_cov] X;
    
    int prior_only;
}

transformed data {
    matrix[n_obs, n_cov] Xc;  // centered version of X without an intercept
    vector[n_cov] means_X;  // column means of X before centering
    for (i in 1:n_cov) {
        means_X[i] = mean(X[, i]);
        Xc[, i] = X[, i] - means_X[i];
    }
}

parameters {
    real<lower=0.0> sigma;
    real<lower=0.0> lambda;
    vector[n_cov] beta;
    real Intcpt;
    
    // Vector of r values for each exposure
    vector<lower=0.0>[n_exp] r;
    
    vector[n_obs] H;
}

model {
    // local params
    matrix[n_obs, n_obs] V;
    V = add_diag( lambda * exp(-cov_Kzr(Z, r)) , 1.0 ) ; // add_diag( , 1.0) as per bkmr code

    // Priors
    sigma ~ inv_gamma(0.001, 0.001);
    lambda ~ inv_gamma(1, 1);
    r ~ cauchy(0, 0.1);
    beta ~ normal(0, 0.5);
    Intcpt ~ normal(0, 0.5);

    // Likelihood
    if (!prior_only) {
        H ~ multi_normal_cholesky(rep_vector(0, n_obs), sigma * cholesky_decompose(V));
        y ~ bernoulli_logit(Intcpt + Xc * beta + H);
    }
}

Any suggestions/insights you might have would be appreciated. Thank you!

avehtari · August 1, 2024, 11:24am

Length scales are known to be weakly informed by data even with normal observation model, and the Bernoulli is even much less informative, so it’s likely that the posteriors for length scales are wide (see, e.g. BNPNIPS_2016_paper_23.pdf - Google Drive and references there in)

The exponentiated quadratic covariance function assumes full interaction between covariates which makes the prior flexible so that it further reduces what can be learned from finite number of observations. If you don’t need interactions or assume e.g. pairwise interactions are sufficient, you could use additive model (without interactions or up to desired order of interactions).

We have some papers illustrating the weaknes of ARD for model selection and better approaches for GP variable selection

Last time I replied quickly between two vacation weeks and made a mistake about R2D2 and lengthscales which @mhollanders pointed out. R2D2 is not applicable for lengthscales. You could use some other prior with a lot of mass near 0 and thick tail for inverse length scales, and regularized horseshoe prior might be applicable (slab part regularizing that the inverse length scales don’t get too big). If you can assume addivitivity then you can use Dirichlet (as inside R2D2) for the magnitudes of the additive components.

Centering GP on top of linear model is often a good choice if there is some trend in the data, and coefficients of that linear model could use R2D2 prior.

When you have different types of components, you can look at the R2D2M2 prior paper for how the contribution is divided hierarchically for different components. I’m not aware of paper providing a good solution for having a joint prior for length scales and magnitudes of different components.

jroon · August 1, 2024, 5:28pm

Thanks @avehtari, thats a really useful answer!!

Thanks for this paper, I had not seen it before. I have a question about the following section:

I can’t quite follow here - what is the interval [0,T] - are these values of x ? What if x does not start at 0?. And What is u? (I’m guessing a realised value of f(x) for given l and alpha? I ask because I have strong prior opinion on say number of zero crossings for a given exposure, I could set a prior on E[Cu] if I understand the fine details.

Yeah the bkmr algorithm was developed to allow for any possible relationship between variables, specifically including interactions, because this can happen with exposure mixtures.

Thank you for the ARD paper links I will check them out. And thanks for R2D2 horseshoe and dirichlet suggestions. I have explored these a bit to no great advantage but may revisit.

Just for clarity here when you say centering GP do you mean centering the data fed to the kernel? i.e. centering the Z matrix in my code above? Or did you just mean centering the linear covariates?
I will look again too at the R2D2M2 paper.

Thanks again for all tips!

avehtari · August 2, 2024, 2:26pm

Yes.

As exponentiated quadratic covariance function is stationary, only the length of the interval matters here.

The level which f may cross (you can set it to 0 to get expected number of zero-crossings)

If the model is additive f= f_1 + f_2 and f_1 is a linear model and f_2 is a GP, it is common to say that the GP is centered on the linear model. If the covariance matrix computation works well for you, you could integrate out the linear model coefficients and combine the resulting GP with dot product covariance function with your GP with exponentiated quadratic covariance function (see, e.g. Gaussian Processes for Machine Learning: Book webpage chapters 2, 4, and 5)

However, due to how Stan autodiff currently works the sampling speed may drop with this approach, and the theoretically slower approach of explicitly presenting the linear model coefficients may also be faster.

If you have also monotonicity assumption for some predictors, that would be a very strong additional information (but not trivial to include if you want to include all possible interactions)

Allowing all possible higher order interactions means you the amount of data needed to learn anything beyond linear model grows exponentially wrt dimensions. Even if you know higher order interactions are possible, you may want to test a model with pairwise or triplewise interactions only.

jroon · August 2, 2024, 2:50pm

Thanks @avehtari, those are really useful answers, some things for me to digest there.

I can’t quite assume monotonicity, as sometimes U or J shaped curves are observed relationships between exposures and outcomes, but anything more complicated then U or J seems biologically implausible. I was thinking perhaps can say crossing 0 more than 3 times is not plausible.

Interesting. How would I limit the GP to pairwise or triplewise interactions only, or is this moving away from the GP ?

avehtari · August 2, 2024, 3:09pm

Additive Gaussian Processes Revisited (just use replace the exponentiated quadratic with the kernel they define and ignore how they do the inference, you may also need to read the older Duvenaud et al paper to better understand this)

jroon · August 5, 2024, 8:59pm

@avehtari - so I’ve been thinking about this alot and I realise it has important implications for my model/application. In data for my typical use case, one might fight different exposures measured on different scales. For example I’m working with serum metal concentrations where one metal might have a range entirely under 1ng/ml others are firmly in the μg/ml range. If I fit an ARD style model on these variables with the same lengthscale for each column, this would imply different numbers of expected zero crossings for different columns of the exposures matrix I’m fitting with the GP. Therefore, I think I should possibly either 1) scale the entire matrix first or 2) set a different length scale prior for each column of the matrix depending on the range of values and the equation above. Or have I missed something here do you think?

avehtari · August 6, 2024, 11:46am

Yes, you need to do one of these. I have usually scaled the data.

Also, remember that the lengthscales are weakly informed by the data, which on the other hand means that changing the prior on lengthscale can have a small effect on the posterior of functions. Thus, it is possible that even thinking a lot about the lengthscale priors does not improve inference on functions, and on the other hand the posterior of the lengthscales can be close to your chosen priors.

jroon · August 14, 2024, 5:41am

Hi @avehtari. Just to come back on this point. If this is my likelihood:

        y ~ bernoulli_logit(Intcpt + Xc * beta + H);

where H is the GP component. Do you mean simply to apply R2D2 on the beta coefficients and omit H from the R2D2 prior scales ? I had been playing with this before making this thread and it worked well enough, but I wasn’t sure it was something I could/should do. Does it change the meaning/interpretation of R2D2 prior? It would be perhaps a partial-R2D2 ?

avehtari · August 15, 2024, 8:44am

It would be good to use joint prior, so that you would have a hierarchical prior which would first divide the total signal variance between the linear model and the GP model, and then divide the variance of the linear model between beta parameters. This is related to R2D2M2 prior. It is also possible that a simpler approach of independent prior for beta and for magnitude of GP would work sufficiently.

R2D2 prior is a joint prior on residual variance and signal variances. If you don’t include all the signal variances, it is not a prior for R2 anymore, and thus the meaning and interpretation changes. Without R2 interpretation, it would be just a global-local shrinkage prior (Dirichlet something…).

jroon · August 22, 2024, 6:30pm

Ah ok this is super helpful, I had not thought about using hierarchical approach like that but I’ll explore it (might take me a while 😅)

One other thing I’m wondering that’s related - I a bit confused on the question of whether you can apply R2D2(M2) when the outcome variable is not gaussian. For example from my code above I’ve a Bernoulli outcome variable:

    // Likelihood
    if (!prior_only) {
        H ~ multi_normal_cholesky(rep_vector(0, n_obs), sigma * cholesky_decompose(V));
        y ~ bernoulli_logit(Intcpt + Xc * beta + H);
    }

Can you shed any light on this topic?

avehtari · August 23, 2024, 7:19am

In case of non-normal observation models, you can use pseudo-variance in place of variance. Section 3.5 in Sparsity information and regularization in the horseshoe and other shrinkage priors discusses pseudo-variance in context of regularized horseshoe prior, but the idea is the same for R2D2. Based on our experiments, pseudo-variance R2D2 seems to work fine, and we’re working on a paper with more details (and brms has an open issue suggesting adding support for pseudo-variance).

Topic		Replies	Views
R2D2M2 prior for non-GLMMs Modeling techniques	11	516	November 10, 2024
Problems converging using custom gamma2 distribution brms fitting-issues	35	2598	November 1, 2020
Changing regression response variable from gaussian to logistic Modeling fitting-issues	30	649	February 29, 2024
Using 2D Gaussian process predictions within model Modeling gaussian-process	11	2447	February 18, 2022
Feedback request: Multivariate Probit Regression with GP Modeling rstan , fitting-issues , specification , loo	14	1156	April 9, 2021

R2D2 prior and Gaussian Processes

Related topics