Nystrom approximation slows down Gaussian process regression?

djhshih · May 23, 2017, 12:13am

I implemented a Gaussian process model in Stan for modelling the difference of nonlinear data from two groups, for which only data from one group is observed at each time point.

The model should ideally run on data set of 500~1000 data points. Currently, the sampling of 500 HMC samples takes more than a day, so I am hoping to improve the speed…

I did some benchmarking on simulated data (number of data points J = 100). Surprisingly, Nystr"om approximation was slower than the original implementation…

The Nystr"om method approximates the kernel matrix K \in R^{J times J} using a subset of the rows and columns of K:

\tilde{K} = C G^{-1} C

where C \in R^{J \times M} contains a subset of the columns M < J from K, and G \in R^{M \times M} is a subset of M rows and M columns from K.

Time complexity of the original algorithm for evaluating a Gaussian process sample f is O(J^3) due to Cholesky decomposition.
Nystr"om approximation should have time complexity O(J M^2), which is linear in the number of data points.
Of course, sampling complicates matter…

Implementation is available at:

The original implementation (triggered when M >= J) obtains 500 HMC samples in 30 seconds.
The Nystr"om implementation (M = 0.2 J) obtains 500 HMC samples in 3 minutes.
I ran 4 chains each, at least 4 times…

Does subsetting regressors for Gaussian process regression not play well with Hamiltonian Monte Carlo?

Or did I screw up the implementation of Nystr"om reproduced below?

// time complexity is O(J M^2) due to matrix multiplication for B
vector approximate_f(matrix K, vector f_eta, vector u, int J, int M) {
	// subset index
	int idx[M];

	// inverse of subset kernel matrix
	matrix[M, M] W;

	idx = sort_indices_asc(u)[1:M];

	// approximate K as \tilde{K} = C W C^T 
	// where
	//   C \in R^{J \times M} is subset of columns
	//   G \in R^{M \times M} is subset of columns and rows
	//   W = G^{-1}
	W = inverse_spd(K[idx, idx]);

	// define B \in R^{J \times M} s.t. \tilde{K} = B B^T
	//   B = C * cholesky_decompose(W)
	// then
	//   f = B * f_eta
	return K[, idx] * (cholesky_decompose(W) * f_eta);
}

Environmental information:

R version 3.3.1
rstan_2.14.2

Andre_Pfeuffer · May 23, 2017, 6:40am

When using Nystrom’s algorithm for GP in Stan we have to consider that Stan calculates the derivatives and this is where the bottleneck remains. To efficiently use Nystrom one would have to by-pass Stans autodiff by using an C++ extension.

Andre

betanalpha · May 23, 2017, 7:07pm

Firstly often the biggest obstruction to good performance with GPs is the choice of hyperpriors. You have to be very careful with how the kernel hyperparameters are regularized or you’ll get a bad (and slow) fit regardless of which approximation you use. We are actively working on documentation for this.

Moreover, seeing an approximation perform more slowly is not all that unexpected. Approximations can spoil the richness of the statistical model which then induces misfit that causes computational problems. Always heed the Folk Theorem!

Andre_Pfeuffer · May 24, 2017, 5:34am

The randomization of your indices consumes most of your speed:
// dummy variable for subsampling vector<lower=0,upper=1>[J] u; idx = sort_indices_asc(u)[1:M];

If you wisely pick your indices out of the sampling process, you’ll notice a speed upgrade.
All in all I’m not sure this randomly picking is possible in Stan. It reminds me to integer
sampling.

Andre

djhshih · May 24, 2017, 11:42pm

@Andre_Pfeuffer

This was the only way to subsample that I could figure out… after some searching, I could not find an alternative way…

This step should be O(n log n) due to sorting … how did you figure out this code segment was the bottleneck?

djhshih · May 25, 2017, 12:11am

Hmm… I can’t seem to edit my original post any more, so I will post a follow-up reply.

Stan just completed the sampling for 500 HMC samples on the real data ($J = 500$)… The chains are no where near convergence (Rhat is on the order of 1e13 ~ 1e14, effective sample sizes are ~2, and the separate chains have very different samples.)

For 500 HMC samples (insufficient)…
The original algorithm took ~20 hours.
The Nystr"om version with a 20% subset took ~40 hours.
The Nystr"om version with a 10% subset took ~10 hours.

Comparing these run times to my previous results on fewer data points, I guess the linear time complexity for Nystr"om has a huge constant for reasons described above…

The real data is split into 30 independent parts… even if I were to make the blue-sky assumption that the chains will converge after 1000 HMC samples in total… the wait time estimates for a single pass of the analysis are:

Original algorithm: 20 hours * 2 * 30 = 50 days
Nystr"om 20% subset: 100 days
Nystrom"om 10% subset: 25 days

Actual run time would probably be much, much worse…

While waiting for sampling to finish, I derived the gradients and implemented a basic coordinate ascent to get MAP estimates for the same Gaussian process regression model.

For 500 data points, coordinate ascent wait time: < 1 second.

And the MAP estimates look reasonable. However, for this derivation, I assumed that the hyperparameters are fixed to make the math easier… I suppose I could also tune them by cross-validation… which isn’t very Bayesian or satisfying.

What optimizations could be made to the Gaussian process regression model so that it can be sampled in a more reasonable time frame in Stan?

mike-lawrence · May 25, 2017, 12:24am

Check out my code here. That example doesn’t have the missing data bit, but should work amidst missing data just fine.

mike-lawrence · May 25, 2017, 12:47am

Now that I look at your code more thoroughly, I don’t think any of the tricks I use in the gist I linked will help, they’re more for dealing with repeated measurement per point on the dimension on which the GPs are being fit. But, I do see a few avenues for better model specification in your code. First, you don’t need the mu parameter; just pre-scale the data to have mean=0. You should also look at the GP section of the wiki page on recommended priors, where you’ll see that the inverse-lengthscale parameterization of the GP is suggested (admittedly, by me). If x is scaled to 0-1, I find a cauchy(0,10) prior on inverse-lengthscale effective (esp. if you use the tan trick from the “reparameterizing the cauchy” section of the manual). I also find scaling the data (y) to have an sd of 1 and using weibul(2,1) priors on both the GP amplitude (alpha in your code) and measurement noise scale (sigma in your model) parameters is effective. All that said, even with those modifications you’ll probably be up to a day to sample 500 points. Hopefully the GPU-based cholesky computations arrive soon now that the NIPS deadline has passed ( :

mike-lawrence · May 25, 2017, 1:19am

You might also be experiencing poor sampling because your model is misspecified. That is, as constructed, there are two latent functions, one for each group, but you’re only modelling a single latent function and a constant coefficient associated with groups. Put another way, the data are generated with both a non-linear intercept function and a non-linear group-difference function, but your model only does inference on the group-difference function, thereby imposing an assumption that the intercept function is 0, which you know it’s not. The code I linked for the gp_regression_example shows a design-general way of fitting situations like this.

Andre_Pfeuffer · May 25, 2017, 1:46am

@djhshih

Isn’t the whole not require the Sherman–Morrison–Woodbury formula? Can we ignore the jitter?

This step should be O(n log n) due to sorting … how did you figure out this code segment was the bottleneck?

It is not a rigorous math statement know: Stan makes use of derivatives. When you sampling indices,
the values are “similar”, but the derivatives may become noncontinuous. One of the Stan developers
have to verify this.

I changed your model and put the sampling outside of Stan,
R:
data$idx ← sample(1:data$J, data$M)
data:
int<lower=1, upper=J> idx[M];
model:
//vector<lower=0,upper=1>[J] u;
vector approximate_f(matrix K, vector f_eta, int idx, int J, int M, matrix P) {
…
}

@mike-lawrence

Consider the cov-function of a GP:
exp(-rho * Sigma_sq + log_shift)
is usually coded as exp(log_shift) * exp(-rho * Sigma_sq)
If we plot(rho, log_shift) I found some GP where they are in correlated.
Thus I modeled them as Multivariate Normaldistribution, one
could go further and use a copula between them.
Agree: GPU support is we need.

Andre

Bob_Carpenter · May 26, 2017, 4:16am

Sorting is unlikely to be a bottleneck—it’s indeed O(N log N) as we just use the C++ sort function and there’s no autodiff overhead.

That’s a very clever approach to data subsampling. The lower and upper bounds are critical here. Not something we really intended to let slip through, because if you subsample the data, you’re not going to converge. Michael also wrote an arXiv paper on why it’s a bad idea in general to subsample for HMC.

The problem one normally runs into with this kind of hack to get discrete sampling is that there is no information flow from idx back to u (it’s like a “cut” in BUGS). But here you don’t care about that info flow, you really want random sampling, which this should do.

Andre_Pfeuffer · May 26, 2017, 6:17am

If we include the borders, we still get divergent transitions.

idx[1] = 1;
idx[2:(M-1)] = sort_indices_asc(u)[1:(M-2)] ;
idx[M] = J;

Andre

Topic		Replies	Views
Efficiency issue with Gaussian Process model Modeling	9	1774	February 11, 2022
Gaussian process regression General	14	962	April 2, 2021
Asymptotic computational complexity of HMC - for a simple GMM tutorial - is it - in the first approximation - O(P^3*N) ? where : P- number of parameters, N - number of samples Modeling performance	16	1451	January 13, 2019
Could you please help to find out why my GP model got slower Modeling gaussian-process	12	1192	April 7, 2021
Seeking expert stan modeler for help speeding up a complex stan model Jobs fitting-issues , specification , performance	4	922	July 29, 2020

Nystrom approximation slows down Gaussian process regression?

Related Topics