R2D2 prior and Gaussian Processes

jroon · August 23, 2024, 7:47am

Ah great thanks @avehtari - indeed I got the impression it was an evolving topic, but thanks for pointers in the right direction! I look forward to seeing the paper you working on!

mhollanders · August 24, 2024, 5:40am

Hey @avehtari, would the idea be to use the intercept for the pseudo-variance, e.g. if you have a Bernoulli model with the back transformed intercept alpha on the probability scale pseudo_var = pow(alpha * (1 - alpha), -1)?

avehtari · August 26, 2024, 5:34pm

If your alpha includes all components affecting the probability, then yes (it’s not clear what you mean by intercept)/. For example in @jroon’s code there was

bernoulli_logit(Intcpt + Xc * beta + H)

so the all these should be included in the back transformation.

mhollanders · August 27, 2024, 10:47pm

I wouldn’t have though that would work, because the scales for R2D2 coefficients and random effects are constructed using the pseudo-variance, but if you use the entire linear predictor to determine the pseudo-variance, doesn’t that create a circular graph?

jroon · September 11, 2024, 6:23am

Hi @avehtari. I finally found some more time to play with this. I was hoping you coudl clarify something for me however. Suppose I use brms to fit a model with R2D2 prior on GP then look at the code:

library(brms)

dat <- mgcv::gamSim(1, n = 30, scale = 2)

# fit GP model
fit1  <- brm(y ~ x1 + gp(x2), dat, chains = 2,
            prior = set_prior(R2D2(mean_R2 = 0.8, prec_R2 = 2, cons_D2 = 0.9), class = "sdgp"))
stancode(fit1)

Relevant code for R2D2/my question:

transformed parameters {
  vector<lower=0>[Kgp_1] sdgp_1;  // GP standard deviation parameters
  real R2D2_tau2;  // global R2D2 scale parameter
  vector<lower=0>[Kscales] scales;  // local R2D2 scale parameters
  real lprior = 0;  // prior contributions to the log posterior
  // compute R2D2 scale parameters
  R2D2_tau2 = sigma^2 * R2D2_R2 / (1 - R2D2_R2);
  scales = scales_R2D2(R2D2_phi, R2D2_tau2);
  sdgp_1 = scales[(1):(Kgp_1)];
  lprior += student_t_lpdf(Intercept | 3, 7.5, 4.8);
  lprior += inv_gamma_lpdf(lscale_1[1][1] | 1.494197, 0.056607);
  lprior += beta_lpdf(R2D2_R2 | R2D2_mean_R2 * R2D2_prec_R2, (1 - R2D2_mean_R2) * R2D2_prec_R2);
  lprior += student_t_lpdf(sigma | 3, 0, 4.8)
    - 1 * student_t_lccdf(0 | 3, 0, 4.8);
}

…and the call to the GP function elsewhere…

cov = gp_exp_quad_cov(x, sdgp, lscale[1]);

My question is - since this is a function for covariance, why is the R2D2 prior set only on the magnitude sdgp term of the GP, and not also on the lscale parameter? Would it be wrong to set the R2D2 prior on the lscale instead? Or include both terms in the R2D2 prior? I ask mainly because in my model I want to set it on the equivalent of the lscale (or both).

avehtari · September 11, 2024, 10:52am

R2D2 prior was originally derived as prior for normal linear model coefficient, so that each has it’s own prior and scales of these priors have joint prior. In case of GP, we don’t have the coefficients, but we can still use the R2D2 approach for creating a joint prior for a collection of scales (magnitudes). In the code you attached, sdgp_1 = scales[(1):(Kgp_1)]; is doing the part of assigning sdgp_1 to be one of the R2D2 local scales. In your example, you are setting the prior only on a single scale (magnitude) parameter, and you don’t get much benefit from that. You can set the R2D2 in brms so that it makes a joint prior, and in your case you could use the joint R2D2 also for the x1 coefficient. R2D2 uses Dirichlet (D in R2D2) to share the signal variance between different local scales. GP lengthscale affects the prior variance, but it is not directly connected and thus using R2D2 prior jointly for scales (magnitude) and lengthscales does not make sense. I don’t remember seeing any proposal for a joint prior for magnitude and several lengthscales. In brms you could set a separate prior for lengthscales, but R2D2 would not be theoretically justified. I don’t have a recommendation for R2D2 style joint prior for lengthscales, but you could use some thick tailed distribution with a lot of mass near zero for the inverse length scale to favor long length scales (no non-linearity).

jroon · September 12, 2024, 4:58pm

Ah ok I revised the math and I believe I get it now. For the prior variance at a given point along x, xi - xj component will reduce to 0, so the variance is sigma, right?
Screenshot 2024-09-11 at 20.45.16

But I find myself wondering here is - I agree the length scale has little importance for the prior. But thinking about the multiple length scales - and in particular in the bkmr model with inverted length scales is that still true? My motivation in wanting to use R2D2 on the length scales was for variable selection as that is an aim. So take the example above: R2D2 prior and Gaussian Processes - #9 by jroon
where I’m using a cauchy for selection on the r, which is in the covariance function as:

matrix[n_obs, n_exp] Zr = Z .* rep_matrix(sqrt(r’), n_obs);
matrix[n_obs, n_obs] K = rep_matrix(0.0, n_obs, n_obs);

    // calculate K
    for (i in 1:n_obs){
        for (j in 1:n_obs){
            if (i < j){
                for (k in 1:n_exp){
                    K[i, j] += square( Zr[i, k] - Zr[j, k] );
                }
                K[j, i] = K[i, j];
            }
        }
    }

So if a column/dimension of Z is “deselected” then the corresponding r[i] would go to zero. I guess r still cant’ really affect the prior variance. But in my experiments a model like that seemed to produce the best prior predictive checks and I dont’ know why (possible errors need to recheck it)

avehtari · September 12, 2024, 8:06pm

Oh, why didn’t you say it in the first place. You should not mix prior information and decision task of variable selection. Prior should be selected based on prior information and then use decision analysis to make variable selection, which is much easier and efficient. For GPs specifically, see Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution for how looking at the length-scale is bad for variable selection. Other relevant papers for variable selection for GPs

jroon · September 13, 2024, 1:19pm

The variable selection method in the reference paper here - I followed the link to the Python code, but I’m not that strong in Python. Is this decision analysis done after fitting the model? i.e. its something you apply to a fit model object for example? Has it been implemented in Stan?
EDIT:: ok I understand the python code a bit better I see the variable selection is done after fitting model. Is there a version in R do you know?

avehtari · September 18, 2024, 6:59pm

“fitting the model” is Bayesian inference producing draws from the posterior. Variable selection is making decisions giving the posterior and utility/loss.

That paper specifically had illustration why you should not try to do variable selection by looking at the lengthscales. You can use the methods from the other papers, too. Unfortunately, we don’t have R code for these. Stan is not very good for GPs with many lengthscales, so there hasn’t been need to make an illustration with Stan. From the four approaches I listed, the feature collapsing is probably the easiest to implement: implement sampling from the predictive distribution in generated block, run standalone generated block with different features collapsed, and measure distances to the original predictive distribution. Unfortunately I don’t have time to make you code illustration, but if you try to do it yourself, I can continue answering questions.

jroon · September 25, 2024, 9:28pm

Hi both. Just looking at this topic more. Its no totally clear for me where in the code to make this transformation. So the following code is generated by bmrs for a gaussian output from the transformed parameters block:

  // compute R2D2 scale parameters
  R2D2_tau2 = sigma^2 * R2D2_R2 / (1 - R2D2_R2);

Am I correct that the transformation should be used here instead of sigma^2 ?

Edit: So in my case I would do this (as these variable as vectors I took the mean over them):

    R2D2_tau2 = (pow(mean(Intcpt + Xc * beta + H), -1) *
                      pow(1 - (mean(Intcpt + Xc * beta + H)), -1)) * 
                      R2D2_R2 / (1 - R2D2_R2);

Aki I was also wondering if there is a psuedovariance transformation for the Weibull function do you know? Thanks

avehtari · September 26, 2024, 9:50am

Yes, replace variance (sigma^2) with the pseudo-variance (and I would not call it a transformation, that is evry time you wrote “transformation” you should have written “pseudo-variance”)

Sure, you can compute the pseudo-variance using the equations given, for example, in Appendix B of Sparsity information and regularization in the horseshoe and other shrinkage priors.

jroon · October 11, 2024, 7:50am

@avehtari I tried to adapt my model to survival outcomes via a weibull family outcome and have learned that gaussian processes and weibull models don’t seem to be very tractable. I get alot of maximum treedepth reached warnings no matter what I set the GP length scale parameters to. I’m coming around to thinking the generally weakly informed length scales, are even less well informed when you wrap it in a weibull function. I was wondering if you knew of good approaches to use GP regression in the context of survival analysis?

avehtari · October 11, 2024, 6:55pm

Good thing is that this just indication of inefficiency but does not invalidate the inference.

Yes.

I’ve had success with GPs in survival analysis, but I did not care about the posterior for length scales knowing they are not well-informed. Can you tell more what you try to achieve?

jroon · October 11, 2024, 7:11pm

Ok so the whole exercise is to make a version of bkmr (Introduction to Bayesian kernel machine regression and the bkmr R package) in Stan. Very short version, the GP is applied across multiple exposure variables, and the posteriors of the length scales visualised to capture the relationships between exposures and the outcome (Introduction to Bayesian kernel machine regression and the bkmr R package). So with a model working well enough for gaussian and binomial outcomes, I’m trying to extend it to Weibull. The original does not have a survival version, though someone has applied it to survival data by splitting the cohort at each event and creating psuedo-observations and using a probit version (The association of urine metals and metal mixtures with cardiovascular incidence in an adult population from Spain: the Hortega Follow-Up Study - PMC). I can probably do this but i’m unsure how valid this approach, so I was trying with Weibull

avehtari · October 13, 2024, 5:46pm

I think the use of length scale posteriors is the weakest way of estimating the relevance of the variables. The link you provided did not show length scale posteriors, but the marginal predictions. Looking at the marginal and conditional predictions (shown later in bkmr intro) makes more sense than looking at the length scale posterior. If you have many variables, looking at the marginal and conditional predictions can be difficult, and then methods providing scalar quantity on the relevance of variables is useful. It depends on how many variables and how weak effects you want to see, how elaborate approaches you may need. The posterior probability approach can work well with independent variables (although even then it can be sensitive to prior choices) but is not easy to interpret in the case of dependent variables. bkmr intro discusses hierarchical selection approach, but that is not automatic as the groups need to be decided by the user. I think the feature collapsing (I linked the paper earlier) is probably the easiest approach to implement with Stan. Projection predictive approach would be more efficient but more complicated to implement.

I have used GPs with all these variants. How big data do you have? Can you tell more about the structure of the data and the phenomenon you are modeling?

jroon · October 14, 2024, 5:30pm

Oh yes sorry I did not mean to mislead here. I am not using length scales to estimate relevance. I am also making graphs of conditional predictions. My previous questions were more from the point of view of understanding the priors for the length scales as I’m not doing this the way bkmr does.

I don’t follow you here, can you elaborate?

Ah yes I should have said - I’m not interested in this hierarchical selection feature. It is an optional feature in bkmr for when you have many exposures from different chemical families and I’m not working with such data. I have some crude code I mashed together to implement FC ranking for this model, but it’s currently not working as I changed something in the model and I’ve yet to update it. I guess a little I’m jumping between subtopics a bit from time to time as I my understanding improves. I’m learning as I go, sorry if my questions are a bit chaotic seeming!

I have toxic metal measurements in blood for case control data for ALS patients and healthy controls, with case-control status as one outcome (so binomial N ~350), and survival in patients only (N ~ 200) as another outcome.

avehtari · October 14, 2024, 6:40pm

In case of dependent covariates, you can drop any single one without big drop in performance, and thus the marginal probabilities will be diluted.

How many different toxic metal measurements? Do you expect interactions between the toxic metals?

Seems that small data, that

probably would work with brms, making the model building easier
with many covariates (that’s why I keep asking this), you can’t get good results with all interactions.

jroon · October 14, 2024, 6:47pm

Ah makes sense.

Sorry. 9 metals. Yes would expect interactions between metals. Also possibly interaction with gender.

So for the exposures 9 (i.e. the GP), for linear component X variables no more than 5.

Ah note also the N’s above are cases with complete data. There are others with missing values, but for now I’ve been ignoring that as it has seemed complicated enough :D

jroon · October 14, 2024, 8:31pm

So I think brms cannot do the individual length scales (unless it can be added via stanvars or something?). I do want to use different lengthscales per outcome - not for selection, but to allow a different number of zero crossings per exposure. I have a version for the binary outcome that works, but I’m in the process of revising the code (there’s quite a few R functions for the posterior predictions so it takes me a bit). For the Weibull I’m getting garbage out still didn’t figure out how. But this coming week or two I’m trying to fix up my code.

Topic		Replies	Views
R2D2 prior and divergences under specific circumstances Modeling brms	3	64	September 18, 2024
GP interpolation with brms brms fitting-issues , specification , brms	3	487	July 28, 2022
Dimension/number of priors for 2D-splines brms specification , priors , splines , interpret-results	2	933	February 25, 2021
How to custom priors for sdgp and lscale parameters of a gaussian process? brms	2	709	November 14, 2023
Priors for log(sigma) in distributional Gaussian model General specification , brms	9	1422	May 5, 2022

R2D2 prior and Gaussian Processes

Related topics