Ah great thanks @avehtari - indeed I got the impression it was an evolving topic, but thanks for pointers in the right direction! I look forward to seeing the paper you working on!
Hey @avehtari, would the idea be to use the intercept for the pseudo-variance, e.g. if you have a Bernoulli model with the back transformed intercept alpha
on the probability scale pseudo_var = pow(alpha * (1 - alpha), -1)
?
If your alpha
includes all components affecting the probability, then yes (it’s not clear what you mean by intercept)/. For example in @jroon’s code there was
bernoulli_logit(Intcpt + Xc * beta + H)
so the all these should be included in the back transformation.
I wouldn’t have though that would work, because the scales for R2D2 coefficients and random effects are constructed using the pseudo-variance, but if you use the entire linear predictor to determine the pseudo-variance, doesn’t that create a circular graph?
Hi @avehtari. I finally found some more time to play with this. I was hoping you coudl clarify something for me however. Suppose I use brms to fit a model with R2D2 prior on GP then look at the code:
library(brms)
dat <- mgcv::gamSim(1, n = 30, scale = 2)
# fit GP model
fit1 <- brm(y ~ x1 + gp(x2), dat, chains = 2,
prior = set_prior(R2D2(mean_R2 = 0.8, prec_R2 = 2, cons_D2 = 0.9), class = "sdgp"))
stancode(fit1)
Relevant code for R2D2/my question:
transformed parameters {
vector<lower=0>[Kgp_1] sdgp_1; // GP standard deviation parameters
real R2D2_tau2; // global R2D2 scale parameter
vector<lower=0>[Kscales] scales; // local R2D2 scale parameters
real lprior = 0; // prior contributions to the log posterior
// compute R2D2 scale parameters
R2D2_tau2 = sigma^2 * R2D2_R2 / (1 - R2D2_R2);
scales = scales_R2D2(R2D2_phi, R2D2_tau2);
sdgp_1 = scales[(1):(Kgp_1)];
lprior += student_t_lpdf(Intercept | 3, 7.5, 4.8);
lprior += inv_gamma_lpdf(lscale_1[1][1] | 1.494197, 0.056607);
lprior += beta_lpdf(R2D2_R2 | R2D2_mean_R2 * R2D2_prec_R2, (1 - R2D2_mean_R2) * R2D2_prec_R2);
lprior += student_t_lpdf(sigma | 3, 0, 4.8)
- 1 * student_t_lccdf(0 | 3, 0, 4.8);
}
…and the call to the GP function elsewhere…
cov = gp_exp_quad_cov(x, sdgp, lscale[1]);
My question is - since this is a function for covariance, why is the R2D2 prior set only on the magnitude sdgp term of the GP, and not also on the lscale parameter? Would it be wrong to set the R2D2 prior on the lscale instead? Or include both terms in the R2D2 prior? I ask mainly because in my model I want to set it on the equivalent of the lscale (or both).
R2D2 prior was originally derived as prior for normal linear model coefficient, so that each has it’s own prior and scales of these priors have joint prior. In case of GP, we don’t have the coefficients, but we can still use the R2D2 approach for creating a joint prior for a collection of scales (magnitudes). In the code you attached, sdgp_1 = scales[(1):(Kgp_1)];
is doing the part of assigning sdgp_1
to be one of the R2D2 local scales. In your example, you are setting the prior only on a single scale (magnitude) parameter, and you don’t get much benefit from that. You can set the R2D2 in brms so that it makes a joint prior, and in your case you could use the joint R2D2 also for the x1
coefficient. R2D2 uses Dirichlet (D in R2D2) to share the signal variance between different local scales. GP lengthscale affects the prior variance, but it is not directly connected and thus using R2D2 prior jointly for scales (magnitude) and lengthscales does not make sense. I don’t remember seeing any proposal for a joint prior for magnitude and several lengthscales. In brms you could set a separate prior for lengthscales, but R2D2 would not be theoretically justified. I don’t have a recommendation for R2D2 style joint prior for lengthscales, but you could use some thick tailed distribution with a lot of mass near zero for the inverse length scale to favor long length scales (no non-linearity).
Ah ok I revised the math and I believe I get it now. For the prior variance at a given point along x, xi - xj component will reduce to 0, so the variance is sigma, right?
But I find myself wondering here is - I agree the length scale has little importance for the prior. But thinking about the multiple length scales - and in particular in the bkmr model with inverted length scales is that still true? My motivation in wanting to use R2D2 on the length scales was for variable selection as that is an aim. So take the example above: R2D2 prior and Gaussian Processes - #9 by jroon
where I’m using a cauchy for selection on the r, which is in the covariance function as:
matrix[n_obs, n_exp] Zr = Z .* rep_matrix(sqrt(r’), n_obs);
matrix[n_obs, n_obs] K = rep_matrix(0.0, n_obs, n_obs);
// calculate K
for (i in 1:n_obs){
for (j in 1:n_obs){
if (i < j){
for (k in 1:n_exp){
K[i, j] += square( Zr[i, k] - Zr[j, k] );
}
K[j, i] = K[i, j];
}
}
}
So if a column/dimension of Z is “deselected” then the corresponding r[i] would go to zero. I guess r still cant’ really affect the prior variance. But in my experiments a model like that seemed to produce the best prior predictive checks and I dont’ know why (possible errors need to recheck it)
Oh, why didn’t you say it in the first place. You should not mix prior information and decision task of variable selection. Prior should be selected based on prior information and then use decision analysis to make variable selection, which is much easier and efficient. For GPs specifically, see Variable selection for Gaussian processes via sensitivity analysis of the posterior predictive distribution for how looking at the length-scale is bad for variable selection. Other relevant papers for variable selection for GPs
The variable selection method in the reference paper here - I followed the link to the Python code, but I’m not that strong in Python. Is this decision analysis done after fitting the model? i.e. its something you apply to a fit model object for example? Has it been implemented in Stan?
EDIT:: ok I understand the python code a bit better I see the variable selection is done after fitting model. Is there a version in R do you know?
“fitting the model” is Bayesian inference producing draws from the posterior. Variable selection is making decisions giving the posterior and utility/loss.
That paper specifically had illustration why you should not try to do variable selection by looking at the lengthscales. You can use the methods from the other papers, too. Unfortunately, we don’t have R code for these. Stan is not very good for GPs with many lengthscales, so there hasn’t been need to make an illustration with Stan. From the four approaches I listed, the feature collapsing is probably the easiest to implement: implement sampling from the predictive distribution in generated block, run standalone generated block with different features collapsed, and measure distances to the original predictive distribution. Unfortunately I don’t have time to make you code illustration, but if you try to do it yourself, I can continue answering questions.
Hi both. Just looking at this topic more. Its no totally clear for me where in the code to make this transformation. So the following code is generated by bmrs for a gaussian output from the transformed parameters block:
// compute R2D2 scale parameters
R2D2_tau2 = sigma^2 * R2D2_R2 / (1 - R2D2_R2);
Am I correct that the transformation should be used here instead of sigma^2 ?
Edit: So in my case I would do this (as these variable as vectors I took the mean over them):
R2D2_tau2 = (pow(mean(Intcpt + Xc * beta + H), -1) *
pow(1 - (mean(Intcpt + Xc * beta + H)), -1)) *
R2D2_R2 / (1 - R2D2_R2);
Aki I was also wondering if there is a psuedovariance transformation for the Weibull function do you know? Thanks
Yes, replace variance (sigma^2) with the pseudo-variance (and I would not call it a transformation, that is evry time you wrote “transformation” you should have written “pseudo-variance”)
Sure, you can compute the pseudo-variance using the equations given, for example, in Appendix B of Sparsity information and regularization in the horseshoe and other shrinkage priors.
@avehtari I tried to adapt my model to survival outcomes via a weibull family outcome and have learned that gaussian processes and weibull models don’t seem to be very tractable. I get alot of maximum treedepth reached warnings no matter what I set the GP length scale parameters to. I’m coming around to thinking the generally weakly informed length scales, are even less well informed when you wrap it in a weibull function. I was wondering if you knew of good approaches to use GP regression in the context of survival analysis?
Good thing is that this just indication of inefficiency but does not invalidate the inference.
Yes.
I’ve had success with GPs in survival analysis, but I did not care about the posterior for length scales knowing they are not well-informed. Can you tell more what you try to achieve?
Ok so the whole exercise is to make a version of bkmr (Introduction to Bayesian kernel machine regression and the bkmr R package) in Stan. Very short version, the GP is applied across multiple exposure variables, and the posteriors of the length scales visualised to capture the relationships between exposures and the outcome (Introduction to Bayesian kernel machine regression and the bkmr R package). So with a model working well enough for gaussian and binomial outcomes, I’m trying to extend it to Weibull. The original does not have a survival version, though someone has applied it to survival data by splitting the cohort at each event and creating psuedo-observations and using a probit version (The association of urine metals and metal mixtures with cardiovascular incidence in an adult population from Spain: the Hortega Follow-Up Study - PMC). I can probably do this but i’m unsure how valid this approach, so I was trying with Weibull
I think the use of length scale posteriors is the weakest way of estimating the relevance of the variables. The link you provided did not show length scale posteriors, but the marginal predictions. Looking at the marginal and conditional predictions (shown later in bkmr intro) makes more sense than looking at the length scale posterior. If you have many variables, looking at the marginal and conditional predictions can be difficult, and then methods providing scalar quantity on the relevance of variables is useful. It depends on how many variables and how weak effects you want to see, how elaborate approaches you may need. The posterior probability approach can work well with independent variables (although even then it can be sensitive to prior choices) but is not easy to interpret in the case of dependent variables. bkmr intro discusses hierarchical selection approach, but that is not automatic as the groups need to be decided by the user. I think the feature collapsing (I linked the paper earlier) is probably the easiest approach to implement with Stan. Projection predictive approach would be more efficient but more complicated to implement.
I have used GPs with all these variants. How big data do you have? Can you tell more about the structure of the data and the phenomenon you are modeling?
Oh yes sorry I did not mean to mislead here. I am not using length scales to estimate relevance. I am also making graphs of conditional predictions. My previous questions were more from the point of view of understanding the priors for the length scales as I’m not doing this the way bkmr does.
I don’t follow you here, can you elaborate?
Ah yes I should have said - I’m not interested in this hierarchical selection feature. It is an optional feature in bkmr for when you have many exposures from different chemical families and I’m not working with such data. I have some crude code I mashed together to implement FC ranking for this model, but it’s currently not working as I changed something in the model and I’ve yet to update it. I guess a little I’m jumping between subtopics a bit from time to time as I my understanding improves. I’m learning as I go, sorry if my questions are a bit chaotic seeming!
I have toxic metal measurements in blood for case control data for ALS patients and healthy controls, with case-control status as one outcome (so binomial N ~350), and survival in patients only (N ~ 200) as another outcome.
In case of dependent covariates, you can drop any single one without big drop in performance, and thus the marginal probabilities will be diluted.

I have toxic metal measurements in blood for case control data for ALS patients and healthy controls
How many different toxic metal measurements? Do you expect interactions between the toxic metals?

(so binomial N ~350), and survival in patients only (N ~ 200) as another outcome.
Seems that small data, that
- probably would work with brms, making the model building easier
- with many covariates (that’s why I keep asking this), you can’t get good results with all interactions.

In case of dependent covariates, you can drop any single one without big drop in performance, and thus the marginal probabilities will be diluted.
Ah makes sense.

How many different toxic metal measurements? Do you expect interactions between the toxic metals?
Sorry. 9 metals. Yes would expect interactions between metals. Also possibly interaction with gender.

with many covariates (that’s why I keep asking this), you can’t get good results with all interactions.
So for the exposures 9 (i.e. the GP), for linear component X variables no more than 5.
Ah note also the N’s above are cases with complete data. There are others with missing values, but for now I’ve been ignoring that as it has seemed complicated enough :D

probably would work with brms, making the model building easier
So I think brms cannot do the individual length scales (unless it can be added via stanvars or something?). I do want to use different lengthscales per outcome - not for selection, but to allow a different number of zero crossings per exposure. I have a version for the binary outcome that works, but I’m in the process of revising the code (there’s quite a few R functions for the posterior predictions so it takes me a bit). For the Weibull I’m getting garbage out still didn’t figure out how. But this coming week or two I’m trying to fix up my code.