R2D2 prior and Gaussian Processes

  • Years ago I successfully used GP with exponentiated quadratic covariance function with all interactions with similar sized data, but now I would recommend also checking additive GPs where you can include the desired level of interactions which would increase the efficiency for main and low order interactions, see Additive Gaussian Processes Revisited
  • Stan is not great for GPs with covariance matrix presentation due to how the autodiff has been implemented (there is work going that can eventually lead to much improved performance). Assuming N observations and K covariates, in case of separate length scale for each covariate the autodiff will create K*N^2 nodes which can make things slow. Your N and K seem to be small enough that it is feasible.
  • MCMC is not great for GPs with non-normal observation models as the joint posterior of latent values and covariance function parameters has challenging shape. GP specific software integrate out the latent values using Laplace, variational, or expectation propagation algorithms, but these are not available in Stan. Your data might be small enough that the sampling would not be terribly slow. Laplace inside Stan models is coming maybe early next year, which is probably too late for you.
  • It is good to start with a simpler model / code. Thus starting with brms generated code is good idea even if doesn’t support separate length scales, as then you would get idea of how efficient the inference is, before making it slower by adding more parameters.
  • With GP specific software, your data would be considered small, and the inference would really fast. Stan provides more flexibility in the model building, and I hope at some point Stan would be even faster.
  • Weibull should be an easy distribution, so I would expect a mistake in your code.
1 Like

Thanks @avehtari .

  • I’ve tried reading the Additive GP paper but it’s a bit over my head honestly. Don’t know how I could apply it.
  • I agree my N and K are practically feasible - I have tractable models already but the results are a bit wrong.
  • Interesting point about Laplace - too late for this project but I’m sure there will be another :D
  • I dont’ think complexity is my current issue, I found and fixed some mistakes in the code since my last post.
  • I didnt’ know there’s GP specific software. Is INLA good for this purpose ? (i’ve used it for spatial mapping)
  • Fixed some mistakes there may be more.

Actually thinking I have a math issue too. The clips below are an extract from one of the bkmr papers. Given this likelihood:

… where they define the equation for the conditional posteriors of the GP functions:

So although they say this can be derived following routine calculations - I am lost as to how they get to that. The other point is that this is for a Gaussian outcome. I’ve implementented it for a gaussian GP regression model and it works well. However, I’m unclear how to modify equation (3) if the outcome, y, is binary and the model is logistic. Do you have any pointers here @avehtari ?

Hi again @avehtari. I’ve made alot of progress with this, but I’ve run into an irksome problem. I have a model for a bernoulli outcome with the GP and R2D2 using pseudovariance as you advised and a centered parameterisation for the betas on the X parameter. But that sometimes causes divergens if beta approaches zero. So, I’m trying to implement an uncentered parameterisation on the betas. However, that leads me to an issue where there is a circular definition between the value of beta and the pseudo-variance. I think it’s easiest to explain with some code extracts:

For the centered parameterisation I have this:

parameters {
    vector[n_cov] beta;

    <..snip...>

    // parameters of the R2D2 prior
    real<lower=0, upper=1> R2D2_R2;
    simplex[2] R2D2_phi;
}

transformed parameters {
    real<lower=0> sd_b;  // Hierarchical SD of the beta coefficients

    real R2D2_tau2;  // global R2D2 scale parameter
    vector<lower=0>[2] scales;  // level1 R2D2 scale parameters
    real pseudovar;

    // compute R2D2 scale parameters
    vector[n_obs] mu_star = Intcpt + Xc * beta + F;
    pseudovar = (1/mean(inv_logit(mu_star))) * (1/(1 - mean(inv_logit(mu_star))));
    R2D2_tau2 = pseudovar * R2D2_R2 / (1 - R2D2_R2);
    scales = scales_R2D2(R2D2_phi, R2D2_tau2);
    sd_b = scales[1];

}

So this works without a problem. For the non-centred version I have:

parameters {
    vector[n_cov] zb;  // unscaled coefficients

    <..snip...>

    // parameters of the R2D2 prior
    real<lower=0, upper=1> R2D2_R2;
    simplex[2] R2D2_phi;

}

transformed parameters {
    vector[n_cov] beta;

    real<lower=0> sd_b;  // Hierarchical SD of the beta coefficients

    real R2D2_tau2;  // global R2D2 scale parameter
    vector<lower=0>[2] scales;  // level1 R2D2 scale parameters
    real pseudovar;

    // compute R2D2 scale parameters
    vector[n_obs] mu_star = Intcpt + Xc * beta + F;
    pseudovar = (1/mean(inv_logit(mu_star))) * (1/(1 - mean(inv_logit(mu_star))));
    R2D2_tau2 = pseudovar * R2D2_R2 / (1 - R2D2_R2);
    scales = scales_R2D2(R2D2_phi, R2D2_tau2);
    sd_b = scales[1];

    beta = zb .* sd_b;  // scale coefficients
}

With this code I get the error:

And thinking about it this makes sense because in this version pseudovar is dependent on a parameter defined in the same block. I’ve tried moving things between blocks in various combinations but can’t find something that works, since, one way or another it seems to end up with the same issue. Do you have any suggestions?

I’ve brought this up a couple of times so keen to see what the answer is. My solution has been to just use the intercept for producing the pseudo variance

1 Like

I am a bit out of sync here, but this sounds very related to my hack here:

The trick is to declare a sigma as part of the brmsfamily.

I hope this helps.

1 Like

Ah so I had similar thoughts this morning and tried out replacing beta with zb in the mu_star term. So:

My thinking was ok I know this is something of a hack, but at least the “pseudo-variance” depending on mu_star is now scaling with the parameters of the model, even if a little off the value it should be. I’ve not had alot of time to test this, but it seems to work on simple simulated data.

I think we have kind of arrived at the same logic and approach?

Wait @jroon, beta has been declared in transformed parameters but is still empty by the time you use it in mu_star? How does that work? This is how I do it, using a binomial function as an example with pseudo variance:

parameters {
  real alpha;  // intercept
  real<lower=0, upper=1> R_sq;
  simplex[X] phi;  // variance partitions
  vector[X] z;  // coefficient z-scores
}
transformed parameters {
  real tau_sq = R_sq / (1 - R_sq),
       pseudo_var = pseudo_var_binomial_logit(alpha);
  vector[X] scales = sqrt(tau_sq * pseudo_var * gamma),
            beta = scales .* z;
}

where I wrote some functions to get pseudo variances.

/**
 * Pseudo-variances for GLMs. See Piironen and Vehtari (2020). Sparsity 
 * information and regularization in the horseshoe and other shrinkage priors.
 * 
 * @param   Intercept (overloaded for canonical links)
 * 
 * @return  Pseudo-variance
 */
real pseudo_var_binomial(real p) {
  real inv(p * (1 - p));
}
real pseudo_var_binomial_logit(real logit_p) {
  return 2 + exp(logit_p) + exp(-logit_p);
}
real pseudo_var_poisson(real mu) {
  return inv(mu);
}
real pseudo_var_poisson_log(real log_mu) {
  return exp(-log_mu);
}

I hope it’s correct. But I don’t know how to get the pseudo variance from the full linear predictor when the coefficients depend on the pseudo variance.

1 Like

I guess that’s just how Stan works. Also, if you check again mu_star is defined before beta is assigned. And mu_star depends on beta, while beta depends on things defined by mu_star, so, I guess it makes sense that it doesn’t work since it is circular. Thanks for sharing your approach.

Yes this is heart of the issue and I guess its not solvable :/

Pinging @davkoh who has recently run experiments with pseudo-variance and hopefully knows the answer (I would need half an hour focus time to provide a confident answer)

1 Like

Hey! Yes indeed, did some simulations and what works really well is just using the data based approximation to the pseudo-variance that is presented in the Piironen & Vehtari (2017) paper, table 1. For the Binomial model it’s simply:

pseudo_var = mu^(-1)*(1-mu)^(-1)

where mu is approximated with the sample average of y. With this, it won’t depend on parameters in the linear predictor and sampling should be easier. We do something similar in the ARR2 paper for the R2 definition for AR type models.

2 Likes

Thanks @davkoh. So since mu = mean(y);, the pseudo_var = mu^(-1)*(1-mu)^(-1) calculation could simply be done once in the transformed_data() block?

Aye, exactly! I just pass it on as data, but that would work too.

1 Like

Great thanks, I’ll give it a try!

1 Like

Thanks @davkoh for the reminder, now I remember that for Piironen & Vehtari (2017) paper I picked the idea of using approximation which is independent of model parameters from Neal (1996). Bayesian Learning for Neural Networks. Appendix A.4. (I think he discussed this also somewhere else, and others have also used alternative approximations).

1 Like

And here is the link to the appendix if anybody is interested: https://link.springer.com/content/pdf/bbm:978-1-4612-0745-0/1

Thanks for sharing @davkoh. In something like a zero inflated model the sample average might be less useful. Have you ever tried using the intercept to compute the pseudo variance? I believe this can work as well? Or at least I’ve never noticed any issues when doing SBC.

@mhollanders , I haven’t tried using the intercept for the pseudo variance. However, Yanchenko et al. (2025) briefly discuss that in section 3.2.1. Your milage might vary with that approach.

Cool. That paper is a bit over my head. At the bottom of page 9 they do say that zero-inflated Poisson has R^2 < 1 which I don’t really understand, as I’d think that in those models you’d have linear predictors for both the zero-inflation component and the count component (i.e. ecological occupancy models), both of which may have R2D2 priors placed on them.

So the replies so far have been really helpful and i’ve managed to remove divergences from my runs. But I’m getting consistently BFMI warnings typically across several chains, and sometimes bulk and tail ESS warnings. This is the current model:

functions {
    // Calculate distance/covariance matrix for exposures
    matrix cov_Kzr(matrix Z, vector r) {
        // Define variables
        int n_obs = dims(Z)[1];
        int n_exp = dims(Z)[2];
        matrix[n_obs, n_obs] K = rep_matrix(0.0, n_obs, n_obs);

        // calculate K
        for (i in 1:n_obs){
            for (j in 1:n_obs){
                if (i < j){
                    K[i, j] += dot_self( (Z[i,] - Z[j,] ) .* sqrt(r)');
                    K[j, i] = K[i, j];
                }
            }
        }
        return K;
    }

    // compute scale parameters of the R2D2 prior    //
    vector scales_R2D2(vector phi, real tau2) {
        return sqrt(phi * tau2);
    }
}

data {
    int n_obs;
    int n_exp;
    int n_cov;

    array[n_obs] int y;
    matrix[n_obs, n_exp] Z;
    matrix[n_obs, n_cov] X;

    // data for the R2D2 prior
    real<lower=0> R2D2_mean_R2;         // mean of the R2 prior
    real<lower=0> R2D2_prec_R2;         // precision of the R2 prior
    vector<lower=0>[3] R2D2_cons_D2;    // concentration vector of the D2 prior

    // prior for r
    real rpar2;


    int prior_only;
}

transformed data {
    //Center X matrix
    matrix[n_obs, n_cov] Xc;  // centered version of X without an intercept
    vector[n_cov] means_X;  // column means of X before centering
    for (i in 1:n_cov) {
        means_X[i] = mean(X[, i]);
        Xc[, i] = X[, i] - means_X[i];
    }

    // Scale and center Z matrix
    matrix[n_obs, n_exp] Zc;  // centered version of Z without an intercept
    vector[n_exp] means_Z;  // column means of Z before centering
    for (i in 1:n_exp) {
        means_Z[i] = mean(Z[, i]);
        Zc[, i] = (Z[, i] - means_Z[i])/sd(Z[, i]);
    }

    real pseudovar = (1/mean(y)) * (1/(1 - mean(y)));
}

parameters {
    vector[n_cov] beta;
    real Intcpt;

    //length scales
    vector<lower=0.0>[n_exp] r;

    // parameters of the R2D2 prior
    real<lower=0, upper=1> R2D2_R2;
    simplex[3] R2D2_phi;

    vector[n_obs] F;
}

transformed parameters {
    real<lower=0> sd_b;  // Hierarchical SD of the beta coefficients
    real<lower=0.0> lambda;
    real<lower=0.0> delta;

    real R2D2_tau2;  // global R2D2 scale parameter
    vector<lower=0>[3] scales;  // level1 R2D2 scale parameters

    // compute R2D2 scale parameters
    R2D2_tau2 = pseudovar * R2D2_R2 / (1 - R2D2_R2);
    scales = scales_R2D2(R2D2_phi, R2D2_tau2);
    sd_b = scales[1];

    lambda = scales[2];
    delta = scales[3];
}

model {
    // local params
    matrix[n_obs, n_obs] V;
    V = add_diag( lambda * exp(-cov_Kzr(Zc, r)) , delta );

    // GP
    target += multi_normal_cholesky_lpdf(F | rep_vector(0, n_obs), cholesky_decompose(V));
    // Likelihood
    if (!prior_only) {
        target += bernoulli_logit_glm_lupmf(y | Xc, Intcpt + F, beta);
    }

    // priors including constants
    target += normal_lpdf(beta | 0, sd_b);
    target += normal_lpdf(r | 0, rpar2);
    target += student_t_lpdf(Intcpt | 3, 0, 1);
    target += dirichlet_lpdf(R2D2_phi | R2D2_cons_D2);
    target += beta_lpdf(R2D2_R2 | R2D2_mean_R2 * R2D2_prec_R2, (1 - R2D2_mean_R2) * R2D2_prec_R2);
}

Now, thru trial an error I’f figured out that the problem is occuring when I have R2D2_Phi values close to 0 or 1. Typically the variables with poor ESS are lambda and delta which as closely related to Phi also. And I can prevent the BFMI wanrings by putting strong priors on the R2D2 concentration vector - for example making setting R2D2_cons_D2; to c(20, 20, 20), forcing all the R2D2_Phi values into the middle range. And while I could run such models, it seem s an extreme solution. I think conceptually the issue maybe be that competitive nature of R2D2_phi values (i.e. my model does not allow for low values of all 3 elements of R2D2_phi), and I was wondering therefore if perhaps I should try the GDR2 prior? (although I dont’ currently understand it well enough) Any wise thoughts anyone?

Do you still get them if you run more iterations?

There is just not much information in the data about some parameters, which leads to challenging posterior shape.

You could include the linear model (Intcpt + Xc*beta) to the GP covariance matrix (using dot product and constant covariance functions) to remove one identifiability issue and reduce the number of unknown parameters.

After having everything in the GP covariance matrix, you could integrate out also F (the new one including everything) using the new laplace_marginal_bernoulli_logit_lpmf() function.

1 Like