Bayesian kernel machine regression in Stan

Hi,

I am trying to model multipollutant effects of cancer incidences on a county basis with a Poisson regression model that also accounts for spatial correlation. I was planning to do this with a Bayesian kernel machine regression (BKMR) to account for non-linear and multiplicative effects in the pollutants. Is there an implementation of the Bayesian kernel machine regression in Stan?

I know that a Gaussian process is a specific form of a Bayesian kernel machine regression and that it is implemented in brms(). I have seen an example where the gp() function is used to model spatial correlation. Can it also be used to model the correlations in the pollutants as with the BKMR and are there any examples for this?

Hi, @Theresa_U: I haven’t heard of Bayesian kernel machine regression, and as far as I know, there’s no discussion of them in our doc or case studies. We do have a lot of doc in Stan around Gaussian processes and also several tutorials outside of our doc.

If all you need to do is introduce some spatial smoothing you can use a much more efficient ICAR model. Here’s a case study with Poisson observations:

It’s not nearly as general as a GP, but it scales better. For GPs, you can start with our user’s guide:

and then check out a bunch of the case studies:

1 Like

Thank you, @Bob_Carpenter !

I will take a look at the literature and examples and see if I can adapt them to my case.

The BKMR also adds a group-wise variable selection to the model and models complex interactions. In my dataset I want to model the relationships of many highly correlated variables and also do a group-wise variable selection in which I identify groups of relevant variables for the outcome. Which approach would you suggest for this?
I read that Projection predictive feature selection is generally a good method for variable selection but does it also work for many correlated variables and would you include all two-way interactions as a starting point?

Hi @Theresa_U. Just saw this thread now. Were you able to make any progress here? I tried once to implement bkmr in Stan but didn’t get very far :/

Edit:

@Bob_Carpenter bkmr is a GP based regression method for for estimating the health effects of multi-pollutant mixtures (Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures - PMC). It’s implemented in R and has almost become the default model in the environmental epidemiology field for modelling multiple exposures.

In terms of the GP element the distinguishing feature is probably that it treats a matrix of exposure data (rows = individuals, columns = exposures) as a multidimensional distance matrix and computes euclidean distances between them. So for example lets say there are 20 individuals with 50 exposures, it uses the fields::distr() R command to reduce that to a 20 x 20 matrix of distances which is then modelled via GP regression written in bespoke R code. This distance metric is recalculated with each MCMC iteration because as @Theresa_U mentioned, it also uses variable selection which happens via a vector multiplied in prior to the euclidean distance is measured. However, this is where my understanding breaks down because I could not rectify the literature description of the variable selection method with the bkmr R code on github.

@Theresa_U even if you could recreate the full bkmr variable selection in Stan, I don’t know how one would integrate the method with spatial autoregression.

Relevant bkmr references:

https://jenfb.github.io/bkmr/overview.html

BKMR exactly as published cannot be implemented in Stan because the variable selection layer is based on spike-and-slab priors. They could in principle be replaced with horseshoe. Below is my attempt from a year ago. This was my first venture into both BKMR and horseshoe, so please use with caution. I didn’t get interesting inferences, which could be due to my data being not informative enough, or bugs, or misunderstandings, or all of the above.

If you end up relying on the authors’ implementation, I recommend that you use it through the bkmrhat package (https://cran.r-project.org/web/packages/bkmrhat/). It runs multiple chains and calculates the usual diagnostics. You may find, as I did, that the authors’ sampler takes an inordinate amount of tuning to get something that mixes passably. Good luck @Theresa_U @jroon and please post feedback - my coworkers are going to rely on BKMR a lot in the next few years.

data {
  int n_obs;
  int n_exp;
  int n_cov;
  
  vector[n_obs]        y;
  matrix[n_obs, n_exp] Z;
  matrix[n_obs, n_cov] X;
}

transformed data{
  real<lower=0.0> nu_local = 1.0;
  real<lower=0.0> nu_global = 1.0;
}

parameters {
  real<lower=0.0>              sigma;
  real<lower=0.0>              tau;
  vector[n_cov]                beta;
  
  // Horseshoe, following the appendix of https://arxiv.org/pdf/1610.05559.pdf
  vector<lower=0.0>[n_exp] z;
  real<lower=0.0> rho1_global;
  real<lower=0.0> rho2_global;
  vector<lower=0.0>[n_exp] rho1_local;
  vector<lower=0.0>[n_exp] rho2_local;
}

transformed parameters {
  // Recovered horseshoe parameters
  real<lower=0.0> tau_hs = rho1_global * sqrt(rho2_global);
  vector<lower=0.0>[n_exp] lambda_hs = rho1_local .* sqrt(rho2_local);
  
  // Sparse coefficients
  vector[n_exp] r = z .* lambda_hs * tau_hs;
}

model {
  matrix[n_obs, n_obs] K;
  
  // Create the covariance matrix.
  // Bobb et al.'s tau is the alpha squared of the Stan user guide notation.
  // The length-scales are absorbed into the r parameters.
  for (i in 1:n_obs){
    K[i, i] = 1.0;
    for (j in 1:n_obs){
      if (i < j){
        K[i, j] = 0.0;
        for (k in 1:n_exp){
          K[i, j] += r[k]*square(Z[i, k] - Z[j, k]);
        }
        K[i, j] = exp(-K[i, j]);
        K[j, i] = K[i, j];
      }
    }
  }
  K = tau * K; 
  K = add_diag(K, square(sigma)); // "Integrate out" the Gaussian process
  
  // Priors
  sigma ~ cauchy(0, 1);
  tau ~ cauchy(0, 1);
  beta ~ std_normal();
  
  // Horseshoe, with student-t priors for local and global parameters.
  z ~ std_normal();
  rho1_local ~ std_normal();
  rho2_local ~ inv_gamma(0.5*nu_local, 0.5*nu_local);
  rho1_global ~ normal(0, 3.0/(n_exp-3.0)/sqrt(n_obs) * sqrt(tau));
  rho2_global ~ inv_gamma(0.5*nu_global, 0.5*nu_global);
  
  // Likelihood
  y ~ multi_normal_cholesky(X*beta, cholesky_decompose(K));
}

1 Like

Thank you both, @jroon and @huffyhenry for your replies. Sadly, I did not work further on the method because I used a different approach. If I find time to retry it, I will let you know.

1 Like

Thanks for sharing the code @huffyhenry you got further than I did with my previous attempt. The GP code makes sense to me, but I’ll need to read up on horseshoe. Yes I’m aware of bkmrhat but share your frustrations re speed. If I figure any more useful things out I’ll post back @huffyhenry and @Theresa_U!

1 Like

It’s not just the speed. In my case, I was getting different results from different chains.

1 Like

Hi @huffyhenry.
Have you thought about visualising the h functions from the model as bkmr can do?

I tried to do so by moving the K matrix to the transformed parameters block:

transformed parameters {
  // Recovered horseshoe parameters -> _hs
  real<lower=0.0> tau_hs = rho1_global * sqrt(rho2_global);
  vector<lower=0.0>[n_exp] lambda_hs = rho1_local .* sqrt(rho2_local);
  
  // Sparse coefficients
  vector[n_exp] r = z .* lambda_hs * tau_hs;
  
  matrix[n_obs, n_obs] K;
  
  // Create the covariance matrix.
  // Bobb et al.'s tau is the alpha squared of the Stan user guide notation. 
  // See tuning parameters section https://jenfb.github.io/bkmr/overview.html
  // The length-scales are absorbed into the r parameters.
  for (i in 1:n_obs){
    K[i, i] = 1.0;  // 1.0 bcos diagonals won't go thru loop and exp(-1) = 0
    for (j in 1:n_obs){
      if (i < j){
        K[i, j] = 0.0;
        for (k in 1:n_exp){
          K[i, j] += r[k]*square(Z[i, k] - Z[j, k]);
        }
        K[i, j] = exp(-K[i, j]);
        K[j, i] = K[i, j];
      }
    }
  }
  K = tau * K; // tau controls the overall smoothness of the exposure-response function
  K = add_diag(K, square(sigma)); // "Integrate out" the Gaussian process

}

and adding a generating_quantities block as follows:

generated quantities {
   vector[n_exp] h = multi_normal_rng( rep_vector(0, n_exp), K);
}

But I get this error:

I can’t figure why this happens from moving K calculations to transformed_parameters.
Any ideas?

This warning is common for this kind of models and is unrelated to the moving of the K matrix to transformed parameters. You can ignore it, especially if it pops up only at the beginning of sampling. But please be aware that this whole business of replacing spike-and-slab with horseshoe is completely untested.

Alas I can’t ignore it. Without the generated_quantitites block it will eventually fit as you say. However, when I try also to include the generated quantities, I get the warnings but the model will also not fit. For example I get this:

Yes usage warnings noted!

The other error is because you have a bug in your generated quantities code. You’re trying to create a vector of length n_exp, but K is n_obs x n_obs. I can’t recall offhand what h was in the original BKMR so can’t advise how to fix it.

1 Like

Ok thanks I’ll take another look. Tbh I changed from rstan to cmdstanr and the behavriou changed so I may have confused myself somehow here. Thanks for help in any case

Hi again. Figured out how h is defined:

They have a term lambda defined as follows:

@huffyhenry do you understand why they have defined lamba like this in terms of tau and sigma? I can’t figure this aspect out.

I’m sorry, but bkmr is not fresh enough in my mind to help you properly here. There is a high chance that I’m going to circle back to this topic over the summer and if so, then I’m going to try and follow up here.

1 Like

Actually looking at the equations above I made some progress. I realised the likelihood in your code wasn’t right, so I added a parameter for h in the parameters block:

  vector[n_obs] h;

… and respecified the likelihood as follows:

  h ~ multi_normal_cholesky(rep_vector(0, n_obs), cholesky_decompose(K));
  y ~ normal(h + X*beta, sigma);

With some test data generated using bkmr SimData function, this model produces very similar answers to bkmr. Though when I tried to silence the likelihood and to do a prior predictive check things are way funky so more work needed I think.

As an aside, I’ve noted that this model is similar to the latent variable GP example in the Stan manual (10.3 Fitting a Gaussian process | Stan User’s Guide), and the variable selection of bkmr (or this version) is perhaps similar to the Automatic Relevance Detection example (10.3 Fitting a Gaussian process | Stan User’s Guide).

Your modification is right only if you remove \sigma^2 from the diagonal of K. As it is now, you add residual variability twice. Otherwise it seems fine, but h is probably not what you want, since it is the effect of the exposures for every observation. Inferences on a specific fixed grid of predictors should be of more interest, and this could be set up in generated quantities.

You are 100% right about the connection to GP regression and to ARD. The bkmr authors seemingly not recognizing this fully (and especially not making the connection to ARD) was the first warning flag that made me look deeper into this topic. But note that ARD finds non-linear effects of exposures, which is not the same as determining if they matter at all.

1 Like

I suspected this was wrong but didn’t turn alot of thought to it yet. Thanks!

Yes had been thinking about this. It’s a bit fiddly to prepare the grids, but I think doing this in R and passing them for generated quantities is the right way.

So I’ve been trying to tame degeneracies in the model and I’m kind of stuck. To do this I made a simpler model without the variable selection and even with only a single length scale variable across all exposure variables. This is the simplified model:

functions {
    matrix cov_Kzr(matrix Z, real r, real tau, real delta) {
        
        int n_obs = dims(Z)[1];
        int n_exp = dims(Z)[2];
    
        matrix[n_obs, n_obs] K;
        
        for (i in 1:n_obs){
            K[i, i] = 1.0;  // 1.0 bcos diagonals won't go thru loop and exp(0) = 1
            for (j in 1:n_obs){
                if (i < j){
                    K[i, j] = 0.0;
                    for (k in 1:n_exp){
                        //K[i, j] += square( Z[i, k] - Z[j, k] )/ r[k];
                        K[i, j] += square( Z[i, k] - Z[j, k] )/ r;

                    }
                    K[i, j] = exp(-K[i, j]);
                    K[j, i] = K[i, j];
                }
            }
        }
        K = tau * K; // tau controls the overall smoothness of the exposure-response function
        K = add_diag(K, delta); // "Integrate out" the Gaussian process
        return K;
    }
    
}

data {
    int n_obs;
    int n_exp;
    int n_cov;
    
    vector[n_obs] y;
    matrix[n_obs, n_exp] Z;
    matrix[n_obs, n_cov] X;
}

parameters {
    real<lower=0.0> sigma;
    real<lower=0.0> tau;
    vector[n_cov] beta;
    
    real<lower=0.0> r;

    // latent gp
    vector[n_obs] h;
}


model {

    // Priors
    sigma ~ normal(0, 1);
    tau ~ normal(0, 0.5);
    beta ~ std_normal();
    r ~ inv_gamma(4, 0.2);

    // Likelihood
    h ~ multi_normal_cholesky(rep_vector(0, n_obs), cholesky_decompose(cov_Kzr(Z, r, tau, 1e-6)));
    y ~ normal(h + X*beta, sigma);
}

Even with a pretty restrictive inverse gamma prior on r to try and prevent length scales that are too long or short, it doesn’t seem to help much. This is the pairs plot:

Lots of uglyness going on here. I’ve tried a few putting more restrictive priors on r and tau but can’t really make much improvement. @huffyhenry any thoughts here? Perhaps the model is simply cursed 😳

If your data make sense and you’re not getting \hat{R} or divergence warnings, then the fit just might be fine. The one thing I don’t understand is why the posterior of \tau is not more concentrated near 0, where lp__ is largest. Having a single r corresponds to the assumption that the exposures affect the outcome in the same way, and it may lead to problems if it is not true; note that r has large draws that ought to be suppressed by the prior. Again, there is no need to estimate h if you have a normal observational model. Integrating it out like in the user guide or my code above should make debugging easier.