MCMC non convergence fitting a latent gaussian process regression

I am trying to fit the following latent Gaussian process regression. Admittedly, there are a couple of complications - the response data I am working with is presence-absence data and there is some amount of data imbalance in the response data. I wrote the following model, pieced together from the Stan manual - it is a multivariate probit model with a latent Gaussian process on the beta coefficients with a squared exponential kernel.

functions
{
  int sum2d(int[,] a) {
  int s = 0;
  for ( i in 1:size(a))
  s += sum(a[i]);
  return s;
  }
}

data
{
  int<lower=1> N; // Number of observations
  int<lower=1> J; // Number of dependent variables
  int<lower=0> K; // Number of independent variables
  matrix[N,K] x; // independent variables
  int<lower=0, upper = 1> y[N, J]; // dependent variables
  matrix[J,J] L_dist; // beta distance matrix
  int<lower=1> U; // upper limit for truncated inverse gamma
  real<lower=0> a; // inverse gamma hyper parameter
  real<lower=0> b; // inverse gamma hyper parameter
}

transformed data
{
  int<lower=0> N_pos;
  int<lower=1, upper=N> n_pos[sum2d(y)];
  int<lower=1, upper=J> d_pos[size(n_pos)];
  int<lower=0> N_neg;
  int<lower=1, upper=N> n_neg[(N * J) - size(n_pos)];
  int<lower=1, upper=J> d_neg[size(n_neg)];

  N_pos = size(n_pos);
  N_neg = size(n_neg);
  {
  int i;
  int j;
  i = 1;
  j = 1;
  for (n in 1:N) {
  for (d in 1:J) {
  if (y[n, d] == 1){
  n_pos[i] = n;
  d_pos[i] = d;
  i += 1;
  } else {
  n_neg[j] = n;
  d_neg[j] = d;
  j += 1;
  }
  }
  }
  }
}

parameters
{
  matrix[K, J] beta;
  vector<lower=0>[N_pos] z_pos;
  vector<upper=0>[N_neg] z_neg;
  vector<lower=0>[J] y_devs;
  vector<lower=0>[J] st_devs;
  real<lower=0> rho;
  real<lower=0> alpha;
  vector[J] beta_mu; // beta centers
}

transformed parameters
{
  vector[J] z[N];
  for (n in 1:N_pos)
  z[n_pos[n], d_pos[n]] = z_pos[n];
  for (n in 1:N_neg)
  z[n_neg[n], d_neg[n]] = z_neg[n];
}

model
{

  matrix[J,J] L_K;
  matrix[J,J] cov;
  matrix[N,J] xbeta = x * beta;

  for (i in 1:(J-1)) {

  L_K[i,i]  = 1 + 0.1; // for numerical stability
  for(j in (i + 1):J){
  L_K[i,j] = alpha^2* exp(-0.5 * (L_dist[i, j])^2/rho^2);
  L_K[j,i] = L_K[i,j];

  }
  }

  for( n in 1:J)
  L_K[n, n] = 1 + 0.0001;

  cov = cholesky_decompose(L_K);

  rho ~ inv_gamma(a, 0.5) T[,U];

  alpha ~ normal(0,1);

  for (k in 1:K)
  beta[k] ~ multi_normal_cholesky(beta_mu, diag_pre_multiply(st_devs, cov));

  st_devs ~ cauchy(0, 2.5);
  y_devs ~ cauchy(0, 2.5);
  beta_mu ~ normal(0, 1);

  for (n in 1:J)
  z[n] ~ normal(xbeta[n], y_devs);

}

More on the data - the dimensions of the response data vary between [1000, 4] to [10,000, 14]. I can consolidate the whole response data instead of running it in chunks (as I am currently) and it would be on the order of [20,000,000, 265]. (The reason I am running it in chunks is because it takes a very long time to run e.g. up to 3 days on a high-performance cluster using 8 CPUs and up to 100GB of memory. Ideally, it could be run in one large model but that is the secondary concern). My primary issue is the chains don’t converge and I regularly receive R-hats between 1.01-2, ESS tail, and bulk < 100. There appears to be very high autocorrelation and the chains get stuck and don’t explore the entire parameter space. Here are my MCMC parameters -

stan_specs:
iter: 100000 # number of iterations to run stan model for
warmup: 40000 # burnin
chains: 4 # number of chains for stan model
thin: 100 # thinning - this should be less than number of iterations
delta: 0.9
stepsize: 1 # step size - raising might help stuck chains overcome their block
stepsize_jitter: 0
cores: 1 # number of cores to run on (most efficient if cores = chains), keep 1 for hpc

Any help would be much appreciated! Thank you in advance

  • In general, Gaussian processes with covariance matrix and direct Cholesky based computation combined with MCMC is too slow for data dimension over 10,000.
  • Due to how Stan’s autodiff is implemented for matrices, it’s performance is even worse, for Gaussian processes with covariance matrix and direct Cholesky based computation combined with MCMC, it’s probably too slow already with data dimension 1000.
  • In general, MCMC is slow for latent GPs with explicit latent values, making the feasible dimensionality even lower.
  • For GPs to scale to dimension [20,000,00, 265] a special GP software is needed with special approximate computation. You can’t get this to work with Stan with any high-performance cluster.
  • For GPs with J*N latent values, you want to marginalize them with something faster than MCMC, the usual options being Laplace, variational inference, and expectation propagation algorithms specialized for GPs (ie you can’t use Stan’s variational inference for this)
  • It’m not able to follow your model, but I think it has errors, and there are some ways to speed-up covariance matrix computation, but it really doesn’t matter as I don’t think you are going able to use Stan for this big data anyway.
  • I think I should add a paragraph to Stan user guide discussing the limits of GPs in Stan
1 Like

Hi, thank you for your reply. Yes, I found that the model could not handle [20,000,000, 265] dimensions so I have been running it at [5,000, 10] and it runs fine but still doesn’t converge. Should I remove the Cholesky-based computation from my model?

I’ve written my model out in mathematical notation. Please let me know if you still think there are errors in the code!

I appreciate your input! Thank you in advance!

@avehtari What special GP software do you recommend nowadays?

Also, I was thinking that discrete time latent AR(p) processes are pretty cheap to compute in Stan since each likelihood evaluation only depends on the previous p values, and for p = 3 or 4, they can get pretty smooth. I know Matern kernels basically give you continuous time analogues to AR(p) processes. Are you aware of anything analogous for continuous time GP’s where the probability of a single point is conditionally independent of all other points given p previous points or the first p derivatives at the previous point? The first page of appendix B in the Rasmussen book mentions something about how the first p derivatives in a continuous time GP are somehow analogous the p previous points in a discrete time AR(p).

It’s more clear now. With this formulation there are huge dependencies between beta_mu, beta, and z and it’s known that all known MCMC methods are slow which explains your convergence issues.

You can make the posterior easier by analytically integrating out beta_mu and beta. As beta_mu has normal prior, and beta has MVN prior, you can integrate out beta_mu, so that you get

\beta \sim \mathrm{MVN}(0, \Sigma + \Sigma_\mu)

where \Sigma_mu is a matrix with all values 1 (the variance from the normal(0, 1)) prior of beta_mu). As beta has MVN prior, then xbeta has also MVN prior and you can write

z \sim \mathrm{MVN}(0, x(\Sigma+\Sigma_\mu)x^T+I\epsilon^2)

where I is an identity matrix (it’s now been quite long time that I regularly worked with combinations of covariance matrices, so there is a small probability that there is an error here). Due to having the positivity constraint for z you can’t marginalize z, but you can write the MVN in Stan as

z ~ multi_normal_cholesky(0, L_K);

where L_K is the Cholesky of the above-mentioned covariance matrix.

I onky know realized that it’s likely N>J, and in that case the integration of beta_mu still makes sense, but integrating out beta, would make the covariance matrix to be of size NxN which would make the Cholesky much slower. But even integrating out beta_mu would probably help a bit.

The target being multivariate makes this more complicated, which may also affect that how much you can integrate out.

I haven’t been following closely the development, but it seems like some of the old ones are not actively developed, and new ones keep popping up, and no single software implements all possible features, so there is no generic recommendation, but you need to know well what features you want and then check if any of the packages have them all.

Yes. See [PDF] Kalman filtering and smoothing solutions to temporal Gaussian process regression models | Semantic Scholar

1 Like