Nystrom approximation slows down Gaussian process regression?

I implemented a Gaussian process model in Stan for modelling the difference of nonlinear data from two groups, for which only data from one group is observed at each time point.

The model should ideally run on data set of 500~1000 data points. Currently, the sampling of 500 HMC samples takes more than a day, so I am hoping to improve the speed…

I did some benchmarking on simulated data (number of data points J = 100). Surprisingly, Nystr"om approximation was slower than the original implementation…

The Nystr"om method approximates the kernel matrix K \in R^{J times J} using a subset of the rows and columns of K:

\tilde{K} = C G^{-1} C

where C \in R^{J \times M} contains a subset of the columns M < J from K, and G \in R^{M \times M} is a subset of M rows and M columns from K.

Time complexity of the original algorithm for evaluating a Gaussian process sample f is O(J^3) due to Cholesky decomposition.
Nystr"om approximation should have time complexity O(J M^2), which is linear in the number of data points.
Of course, sampling complicates matter…

Implementation is available at:

The original implementation (triggered when M >= J) obtains 500 HMC samples in 30 seconds.
The Nystr"om implementation (M = 0.2 J) obtains 500 HMC samples in 3 minutes.
I ran 4 chains each, at least 4 times…

Does subsetting regressors for Gaussian process regression not play well with Hamiltonian Monte Carlo?

Or did I screw up the implementation of Nystr"om reproduced below?

// time complexity is O(J M^2) due to matrix multiplication for B
vector approximate_f(matrix K, vector f_eta, vector u, int J, int M) {
	// subset index
	int idx[M];

	// inverse of subset kernel matrix
	matrix[M, M] W;

	idx = sort_indices_asc(u)[1:M];

	// approximate K as \tilde{K} = C W C^T 
	// where
	//   C \in R^{J \times M} is subset of columns
	//   G \in R^{M \times M} is subset of columns and rows
	//   W = G^{-1}
	W = inverse_spd(K[idx, idx]);

	// define B \in R^{J \times M} s.t. \tilde{K} = B B^T
	//   B = C * cholesky_decompose(W)
	// then
	//   f = B * f_eta
	return K[, idx] * (cholesky_decompose(W) * f_eta);
}

Environmental information:

R version 3.3.1
rstan_2.14.2

When using Nystrom’s algorithm for GP in Stan we have to consider that Stan calculates the derivatives and this is where the bottleneck remains. To efficiently use Nystrom one would have to by-pass Stans autodiff by using an C++ extension.

Andre

1 Like

Firstly often the biggest obstruction to good performance with GPs is the choice of hyperpriors. You have to be very careful with how the kernel hyperparameters are regularized or you’ll get a bad (and slow) fit regardless of which approximation you use. We are actively working on documentation for this.

Moreover, seeing an approximation perform more slowly is not all that unexpected. Approximations can spoil the richness of the statistical model which then induces misfit that causes computational problems. Always heed the Folk Theorem!

The randomization of your indices consumes most of your speed:
// dummy variable for subsampling vector<lower=0,upper=1>[J] u; idx = sort_indices_asc(u)[1:M];

If you wisely pick your indices out of the sampling process, you’ll notice a speed upgrade.
All in all I’m not sure this randomly picking is possible in Stan. It reminds me to integer
sampling.

Andre

@Andre_Pfeuffer

This was the only way to subsample that I could figure out… after some searching, I could not find an alternative way…

This step should be O(n log n) due to sorting … how did you figure out this code segment was the bottleneck?

Hmm… I can’t seem to edit my original post any more, so I will post a follow-up reply.

Stan just completed the sampling for 500 HMC samples on the real data ($J = 500$)… The chains are no where near convergence (Rhat is on the order of 1e13 ~ 1e14, effective sample sizes are ~2, and the separate chains have very different samples.)

For 500 HMC samples (insufficient)…
The original algorithm took ~20 hours.
The Nystr"om version with a 20% subset took ~40 hours.
The Nystr"om version with a 10% subset took ~10 hours.

Comparing these run times to my previous results on fewer data points, I guess the linear time complexity for Nystr"om has a huge constant for reasons described above…

The real data is split into 30 independent parts… even if I were to make the blue-sky assumption that the chains will converge after 1000 HMC samples in total… the wait time estimates for a single pass of the analysis are:

Original algorithm: 20 hours * 2 * 30 = 50 days
Nystr"om 20% subset: 100 days
Nystrom"om 10% subset: 25 days

Actual run time would probably be much, much worse…

While waiting for sampling to finish, I derived the gradients and implemented a basic coordinate ascent to get MAP estimates for the same Gaussian process regression model.

For 500 data points, coordinate ascent wait time: < 1 second.

And the MAP estimates look reasonable. However, for this derivation, I assumed that the hyperparameters are fixed to make the math easier… I suppose I could also tune them by cross-validation… which isn’t very Bayesian or satisfying.

What optimizations could be made to the Gaussian process regression model so that it can be sampled in a more reasonable time frame in Stan?

Check out my code here. That example doesn’t have the missing data bit, but should work amidst missing data just fine.

Now that I look at your code more thoroughly, I don’t think any of the tricks I use in the gist I linked will help, they’re more for dealing with repeated measurement per point on the dimension on which the GPs are being fit. But, I do see a few avenues for better model specification in your code. First, you don’t need the mu parameter; just pre-scale the data to have mean=0. You should also look at the GP section of the wiki page on recommended priors, where you’ll see that the inverse-lengthscale parameterization of the GP is suggested (admittedly, by me). If x is scaled to 0-1, I find a cauchy(0,10) prior on inverse-lengthscale effective (esp. if you use the tan trick from the “reparameterizing the cauchy” section of the manual). I also find scaling the data (y) to have an sd of 1 and using weibul(2,1) priors on both the GP amplitude (alpha in your code) and measurement noise scale (sigma in your model) parameters is effective. All that said, even with those modifications you’ll probably be up to a day to sample 500 points. Hopefully the GPU-based cholesky computations arrive soon now that the NIPS deadline has passed ( :

You might also be experiencing poor sampling because your model is misspecified. That is, as constructed, there are two latent functions, one for each group, but you’re only modelling a single latent function and a constant coefficient associated with groups. Put another way, the data are generated with both a non-linear intercept function and a non-linear group-difference function, but your model only does inference on the group-difference function, thereby imposing an assumption that the intercept function is 0, which you know it’s not. The code I linked for the gp_regression_example shows a design-general way of fitting situations like this.

1 Like

@djhshih

Isn’t the whole not require the Sherman–Morrison–Woodbury formula? Can we ignore the jitter?

This step should be O(n log n) due to sorting … how did you figure out this code segment was the bottleneck?

It is not a rigorous math statement know: Stan makes use of derivatives. When you sampling indices,
the values are “similar”, but the derivatives may become noncontinuous. One of the Stan developers
have to verify this.

I changed your model and put the sampling outside of Stan,
R:
data$idx <- sample(1:data$J, data$M)
data:
int<lower=1, upper=J> idx[M];
model:
//vector<lower=0,upper=1>[J] u;
vector approximate_f(matrix K, vector f_eta, int[] idx, int J, int M, matrix P) {

}

@mike-lawrence

Consider the cov-function of a GP:
exp(-rho * Sigma_sq + log_shift)
is usually coded as exp(log_shift) * exp(-rho * Sigma_sq)
If we plot(rho, log_shift) I found some GP where they are in correlated.
Thus I modeled them as Multivariate Normaldistribution, one
could go further and use a copula between them.
Agree: GPU support is we need.

Andre

Sorting is unlikely to be a bottleneck—it’s indeed O(N log N) as we just use the C++ sort function and there’s no autodiff overhead.

That’s a very clever approach to data subsampling. The lower and upper bounds are critical here. Not something we really intended to let slip through, because if you subsample the data, you’re not going to converge. Michael also wrote an arXiv paper on why it’s a bad idea in general to subsample for HMC.

The problem one normally runs into with this kind of hack to get discrete sampling is that there is no information flow from idx back to u (it’s like a “cut” in BUGS). But here you don’t care about that info flow, you really want random sampling, which this should do.

If we include the borders, we still get divergent transitions.

idx[1] = 1;
idx[2:(M-1)] = sort_indices_asc(u)[1:(M-2)] ;
idx[M] = J;

Andre