Approximate GPs with Spectral Stuff

Just replace sigma with sigma[i] and put a horseshoe on the jump size if you want a process with jumps (this is type-G)

Jump times are a point process which is a high-dimensional discrete parameter, so you’ve got to sum it out somehow if you want that sort of process

Replying from the fuuuuuutuuuuure … :)

Anyway, I tried to make sense of https://github.com/bbbales2/basketball/blob/master/models/rff_bernoulli.stan in light of the papers “Random Features for Large-Scale Kernel Machines” and “On the Error of Random Fourier Features,” and I think there may be a mistake.

I noticed that the factor scale is given as sqrt(2.0 / N), where N is the length of the input data. Judging from the paper “On the Error of Random Fourier Features”, scale should be 1/sqrt(K), where K is the number of draws of omega. Notice that in equation (1) of that paper, the variable D is twice the number of draws K, since the subscript for omega there only goes to D/2, so sqrt(2/D) corresponds to 1/sqrt(K).

Also, I’m still unsure how fhat in your Stan file is an approximate sample from a Gaussian process. It looks like fhat[i] is the dot product of z(x[i]*bw) and w = [beta1[1] beta2[1] ... beta1[K] beta2[K]], where z(…) is the vector from Algorithm 1 of “Random Features for Large-Scale Kernel Machines” and beta1 and beta2 are from your Stan file. Judging from the 4th or 5th paragraph of the paper “Random Features for Large-Scale Kernel Machines”, that dot product is supposed to be an approximation to a draw from a Gaussian process (I think?), but the paper seems to gloss over how that dot product and the draw relate. Maybe I’m just unfamiliar enough with Gaussian processes to not get the connection.

Just vaguely looking at the paper, I think you’re right on the scaling. I just stole this from @flaxter’s repo linked at the top of the file, so I think we probly both gotta patch things. The parameters would just accommodate this scaling probly and it wouldn’t matter too much, but I think you’re right.

For the fhat, the things to stare at are Theorem 1 in the paper + Algorithm 1 + eq. https://wikimedia.org/api/rest_v1/media/math/render/svg/2ab83236f578bfbe15cf763fe25ab338c7af8aa5 on https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_space#Integral_operators_and_Mercer.27s_theorem.

So the thing is we want to build a Mercer expansion of our Kernel (the Wikipedia thing). Instead of building the Kernel itself, we build a square root of it which the paper calls a feature map (Algorithm 1, we’re building z, where z * z’ = k instead of k). When you do things like this the code looks like we’re fitting stuff with basis functions instead of doing any GP thing (there’s no Choleskys to compute, since we directly construct the sqrt). The last thing is we have implicitly decided whatever kernel we’re going to work with has a true covariance that looks like k(x - z), since we’re writing it as a Fourier expansion (Theorem 1).

I think this is the way to interpret things at least… Haha

I ended up getting quite a bit out of the Fasshauer/McCourt Matlab book @flaxter linked too way back at the beginning of this. Took a while though. There’s a lotta stuff.

Btw, I think it’d be fine to make a new thread with this question and just link back to old stuff. This one is starting to get really long so it’d be hard for newcomers to the forums to make heads or tails of what is happening in it.

Hi, to dig up this old thread. I’ve been working with these to do some Poisson time series modeling for some gamma ray data. In general, all works fine and I’ve got a reduce_sum version that really speeds things up with there a lot of observations:

real partial_log_like_bw_multi_scale(int[] counts_slice, int start, int end, vector time, vector exposu\
         re, row_vector omega1, row_vector omega2, vector beta1, vector beta2, real dt, real bkg, real scale1, r\
         eal scale2, real amplitude, int k) {
     
      int N = size(counts_slice);
     
       vector[N] time_slice = time[start:end] - dt;
       vector[N] expected_counts_log;
       vector[N] expected_counts;
      
       matrix [N,k] tw1 = time_slice * omega1;
      matrix [N,k] tw2 = time_slice * omega2;


 expected_counts_log = ((cos( time_slice * omega1) + cos( time_slice * omega2)  ) * beta1) + ((sin( time_slice * omega1) + sin( time_slice * omega2)  ) * beta2);

  expected_counts = exposure[start:end] .* (exp(scale * expected_counts_log) * amplitude + bkg);


  return poisson_lpmf(counts_slice | expected_counts);



One can ignore some the specific things here for the problem I’m working on (dt is a time delay in the signal and amplitude/ background are just some values for scaling around observations). Also, the exp(…) is to make the signal positive.

This can be fed to reduce_sum for slicing over x, y (time/counts) and the performance improves. Here I am doing the non-stationary version of the kernel and instead of sampling omega, I’m fitting for it which really stabilizes things.

Basically there is a bandwidth (inverse length scale) parameter that is sampled for and it sets the distribution for omega:

 parameters {
         vector[k] beta1; // the amplitude along the cos basis
         vector[k] beta2; // the amplitude along the sin basis
  
          row_vector[k] omega_var[2];
          vector[2] log_scale;
          ordered[2] log_bw;
 
      }
transformed parameters {

  vector[2] scale = exp(log_scale) * inv_sqrt(k);
  vector[2] bw = exp(log_bw);
 

  row_vector[k] omega[2];
  
  // non-center
  omega[1] = omega_var[1] * bw[1];
  omega[2] = omega_var[2] * bw[2];
  
  
}


model {

  // priors

  beta1 ~ std_normal();
  beta2 ~ std_normal();

  log_scale ~ std_normal();

  log_bw ~ normal(0, .5);

 
  omega_var[1] ~ std_normal();
  omega_var[2] ~ std_normal();

  target += reduce_sum(partial_log_like_bw_multi_scale, counts[1], grainsize[1],
                       time[1], exposure[1],
                       omega[1], omega[2], beta1, beta2,
                       0., bkg, scale[1], scale[2], 1., k);

}

You can see I’ve off-centered omega as it forms a one way normal. As I said, this is pretty fast, and works well for what I’m trying to do… but there are non-concentrated divergences everywhere and I can’t seem to figure out what is causing them. Originally it was the tension in omega and bw, but that is dealt with. s Stetting the step size via adapt-delta helps some… but there must be another issue. Perhaps the priors for the length scale are too wide, but I’ve not been able to correct for that.

Nevertheless, I wanted to share this parallel version if anyone is interested. It is possible there are further ways to speed it up.

1 Like

@martinmodrak posted a guide on how to deal with divergences… have a look at that.

On a different note (not related), still: You should reformulate the log-lik such that you use poisson_log_lpmf instead - so leave your predictor on the log-space (and just offset it by log_exposure). Moreover, you can drop the normalisation constants with poisson_log_propto_lpmf if I am not mistaken.

1 Like

yes. i’ve had a look at the guide. it’s very nice.

So i’m aware of the log_poisson, but i have to add a term to the signal. Thus i would need to do a loop and use log_sum_exp. Wouldn’t his add a lot of computation vs the vectorized version? Or is exp much more expensive?

Ah… having to use a loop is not nice. I would always try the log formulation to get numerically more stable inferences, but it may actually slow things down. However, you are not having issues at it sounds in this area of the model, so probably put your time budget somewhere else (I just have personally a preference for the log-formulation due to empirical experience).

No, I agree, I’ve been frustrated with this as there is no nice way to sum things in Poisson space. Though, it may be a tiny hit due to the speed of the reduce_sum. I will investigate and report back.

yup… the multi-core brute-force reduce_sum may balance the additional cost here. In any case, use the _propto_(poisson_propto_lpmf or poisson_log_propto_lpmf) statements for the log-likelihood if you do not need proper normalisation.

The other problems with divergences… pairs plots, diagnostics, check the mass matrix the warmup got, bayesplot package has more vignettes on this, …

1 Like

I was not aware of this. So this is the exposed part of the “~” syntax?

The " ~" syntax leads to non-normalised densities, but you cannot use them in the partial sum function needed for reduce_sum. In the partial sum function you have to use the propto stuff to avoid the normalisation (which you really want to avoid for the Poisson to avoid this gamma function calls needed to normalise).

1 Like

@wds15 I have tried the poisson_log_propto_lpmf approach with a loop and log_sum_exp. There is a minimal performance hit. Nothing changed re: divergences but the chains seem to bounce around a bit less which is likely a result of better numerical stability. I will continue to investigate.

Since we’re in this thread, this is a recent paper on approximate GPs using basis functions: https://arxiv.org/abs/2004.11408 . Since you’re interested in doing GP basis functions, you’ll probably find something interested in there.

The lengthscale parameter in a GP can do funnel-like things (in that it can give you divergences cause curvature is changing). I think in GPs the length scale and the output scale parameter (the one that determines the magnitude of the output) can be non-identified which could probably lead to divergences. Are you estimating both?

4 Likes

Bit off-topic: there seem to be three ways of incrementing log probability, i.e.

  1. target +=
  2. target +=_propto_
  3. ~

so are 2. and 3. equivalent? Is it always better to use the _propto_versions when writing target += statements, if there is no specific reason to get the actual log probabilities? There was a thread somewhere where people could not figure out why target += and ~ lead to a bit different inferences and I think that the result was different rounding errors. Do all of these versions add the invisible term from Jacobian transform the same way if log prob depends on transformed parameters?

Also target +=_propto_ isn’t supported, right @wds15?

I think we have plans for formally doing something here in the future, but I don’t think this is an intentional feature.

1 Like

Yes, this is an unsupported feature (also not documented) that we were close to replacing (https://github.com/stan-dev/stanc3/pull/562) in the last release and should be gone for the next one. So definitely do not rely on it.

1 Like

Ups… I did not know that it’s not officially supported… so I will stop writing about it…

1 Like

uh oh… I was wondering why I had not seen that. So… should I remove propto from my reduce_sum codes?

It works fine, it’s just not supported, meaning that it was never released and will change without deprecation warnings and future support.

Barring any surprise, the next release will instead of the unsupported foo_propto_lpdf officially support foo_lupdf. So you can use it in the sense that the inference is not broken if you use propto, but be prepared to rewrite once the next release is out.

2 Likes

The relevant thread for that change is here: Request for final feedback: User controlled unnormalized (propto) distribution syntax

It a complicated feature that does take time to to get right.

1 Like