I don’t see any obvious routes to speedups. Presumably you’ve already tried on a GPU?
One thing that’s unlikely to help much but worth a try if you’re curious is to precompute the set of unique differences in X
. This saves a small amount of compute during sampling but if the set of unique differences is actually much smaller than the total number of differences, you might save more substantial compute. In my tests long ago I found the speedup from this didn’t match simply using cov_exp_quad, but possibly since you’re not using cov_exp_quad yourself it might come in handy.