# Help speeding up bernoulli Gaussian process model

Hi all,
I’m fitting a bernoulli Gaussian process model for species distribution modeling, and it’s quite slow. I was wondering if anyone had any tips to help speed it up.

Essentially, I have species occurrences (presence only) downloaded from GBIF. For each species there’s probably 200-500 presences (1’s). To generate ‘pseudo-absences’, I randomly sample 1000 background points from the map (i.e. North America) and define them all as absences (0’s). So I have somewhere between 1200-2000 points for fitting a bernoulli GPM.

I’ve followed the code in the STAN manual, and have my model as:

``````data{
int<lower=1> N;
row_vector[2] x[N];
int<lower=0,upper=1> y[N];
}
transformed data{
real delta = 1e-9;
}
parameters{
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta ;
}
transformed parameters{
vector[N] f;
{
matrix[N,N] L_K;
matrix[N,N] K = cov_exp_quad(x, alpha, rho);
for(n in 1:N){
K[n,n] = K[n,n] + delta ;
}
L_K = cholesky_decompose(K);
f = L_K * eta;
}
}
model{
eta ~ normal(0,1);
rho ~ inv_gamma(5, 5);
alpha ~ normal(0,1);
a ~ normal(0,1);

y ~ bernoulli_logit(a+f);
}
``````

It works fine, the area-under-curves are good, etc. But it is extremely slow:

``````Gradient evaluation took 0.910985 seconds
1000 transitions using 10 leapfrog steps per transition would take 9109.85 seconds.
``````

I don’t remember it being this slow when I first ran the models a few years ago (I put this project aside and came back to it). I tried to speed it up by first optimizing the model, then using the optimized parameters as the starting values, but I’m not sure that helps too much. Does anyone have any advice for speeding this one up?

Thanks

FWIW, here’s the optimizer:

``````Initial log joint probability = -1736.59
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
Exception: cholesky_decompose: Matrix m is not positive definite  (in 'unknown file name' at line 22)

19       -704.81      0.492961       41.2802           1           1       28
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
39      -526.872      0.789057       15.5561      0.5514      0.5514       52                                                                                                  [150/527]
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
59      -482.263      0.117319       11.0234           1           1       73
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
79      -467.499     0.0949339       17.2224           1           1       93
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
99      -463.204      0.176057       7.71779           1           1      114
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
119      -459.926      0.280257       8.86097           1           1      138
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
139      -451.633      0.368367       22.6551           1           1      160
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
159      -432.905      0.120548       18.6509      0.4404      0.4404      181
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
179      -413.768      0.179617       22.0664      0.4095           1      204
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
199      -396.866      0.150389       33.9249      0.1559           1      227
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
219      -374.362      0.675426       31.9265           1           1      248

239      -353.261      0.278901       22.4105           1           1      269                                                                                                  [130/527]
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
259       -334.69      0.139741       27.4685      0.7923      0.7923      290
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
279      -315.248      0.111221        19.985           1           1      311
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
299      -295.522      0.192555       32.7448       0.377           1      334
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
319       -280.72     0.0874414       10.7889           1           1      355
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
339      -271.754     0.0669797       6.88682           1           1      375
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
359       -269.35      0.061617       2.52318           1           1      396
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
379      -268.969     0.0416614       1.23482           1           1      420
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
399      -268.936    0.00309279      0.364431           1           1      442
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
419      -268.932    0.00455218       0.22627           1           1      464
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
439      -268.931   0.000199942     0.0384264      0.5265      0.5265      488
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
452      -268.931   0.000113569     0.0131722      0.3661           1      502
``````

New update.
I moved the entire transformed parameters section into the model block, and that speed things up so much. The iterations have been cut down by an order of magnitude, and the leapfrog estimates are half the time:

``````Initial log joint probability = -1839.18
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
19      -698.413      0.829057       13.1064       1.274      0.1274       23
Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
39      -697.594    0.00118772    0.00106796      0.7247      0.7247       50
Optimization terminated normally:
Convergence detected: relative gradient magnitude is below tolerance

1000 transitions using 10 leapfrog steps per transition would take 4361.95 seconds.
``````

How is it that it’s so much faster inside the model block than inside transformed parameters?

2 Likes

I wouldn’t expect there to be that dramatic a speed-up actually, but there might be a reason. @rok_cesnovar do you know if variables declared in the `model` block transpile to `double` or `var` types?

As for the rest of your model, you have some other room for optimising the code. A couple of small things first:

You can simplify your construction of `K` by using the `add_diag` function instead of a loop:

``````matrix[N,N] K = add_diag(cov_exp_quad(x, alpha, rho), delta);
``````

For your priors, you should use `std_normal()` rather than `normal(0,1)`.

For some bigger things, you should look into using the `cmdstanR` package for two reasons: GPU acceleration and within-chain parallelisation. You’ll be able to offload cholesky decomposition and multiplication of`L_K` and `eta` onto the GPU, which should cut the time down. You can then also look into using the `reduce_sum` framework for parallelising the `bernoulli_logit` call, more information here: https://mc-stan.org/docs/2_25/stan-users-guide/reduce-sum.html

1 Like

Thanks for the tips! One issue I bumped into is that I’m using PyStan, which hasn’t been updated since 2.19.1.1. So I don’t have access to the latest StanMath, which includes add_diag, etc.

No worries, in that case you’ll probably want to look at `cmdstanPy`: https://pypi.org/project/cmdstanpy/, which uses the latest version of Stan

1 Like

Another potential explanation is that variables defined in the model block don’t get written to the csv, so if you’re dealing with lots and a slow write medium, then this can make a big difference.

Nice! That’s new to me even as a GP-enthusiast.

Oh yeah, good call with the writing to disk, that could definitely have an impact.

@Nathan_Lemoine After looking at your model again, I’ve realised you can optimise your model even further by using the `bernoulli_logit_glm` distribution. If we look at your likelihood:

``````y ~ bernoulli_logit(a+f);
``````

and note that this is the same as:

``````y ~ bernoulli_logit(a + L_K * eta);
``````

This means that you can treat the model as GLM where `a` is the intercept, `L_K` is the matrix of covariates, `eta` is the vector of coefficients:

``````  y ~ bernoulli_logit_glm(L_K, a, eta);
``````

This `glm` distribution is more optimised and efficient than constructing the multiplication of `L_K * eta` separately. Additionally, as of Stan 2.25 (available through `cmdstanPy`) this can be evaluated on the GPU, providing an even bigger speed-up.

Combining that with my previous suggestions, your model could look like:

``````data{
int<lower=1> N;
row_vector[2] x[N];
int<lower=0,upper=1> y[N];
}
transformed data{
real delta = 1e-9;
}
parameters{
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}

model{
matrix[N,N] L_K;

eta ~ std_normal();
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();

delta));

y ~ bernoulli_logit_glm(L_K, a, eta);
}
``````

Let me know if you have any questions about that

5 Likes

Huh, so I guess in fact most GPs should be using the `_glm()` functions, eh? For example, with a gaussian-noise outcome you’d do ` y ~ normal_id_glm(f,0,z,sigma)` where `z` is the `std_normal()` parameter used to enable the cholesky representation of the latent GP `f` and `sigma` is the measurement noise term. Super cool. I wonder what kinds of speedup this has? It’s funny how I consider myself pretty attentive to developments in the language but still manage to miss stuff like this!

Haha it moves pretty fast, not always easy to keep up (for me as well!).

Yeah, in general the `_glm` functions will be more efficient in their construction of both the likelihood and its gradients. But in the past they could only be offloaded onto the GPU when both `x` and `y` were data, not parameters, which limited their applicability to something like this.

Thanks to Tadej’s consistently excellent work, as of 2.25 this restriction was removed so now any input can be a parameter. This means there should be some pretty good speed-ups available for GPs now

1 Like

Thanks all!! I’ll give these a try and see what happens.

The trick is that I need the variable f written to disk since I’m saving it as the posterior prediction for mapping later on, so I don’t know if I can move it. Unless putting it in generated quantities would somehow be faster? Not sure.

Give it a try, I’d be curious to see what your result is. Since the samples are written to file once, it shouldn’t actually make a difference. But in retrospect, my intuition is actually that `f` isn’t nearly large enough to make your likely storage medium speeds a factor, so I’m wondering if maybe you have a RAM bottleneck or something such that keeping the variable around during the main sampling procedure makes everything slower? Vars in the GQ get computed once at the end, once the sample is accepted, while TPs stick around in memory as the algorithm does its HMC stuff (and I think there’s a bunch of “copies” from different points along the trajectory).

I’ve tried using the reduce sum framework and running a chain in parallel. I saw no speedup:

``````Chain 1 -   done: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [43:25<00:00, 52.11s/it]
Model optimization (Bernoulli Logit) took 94.10 seconds
Model optimization took 1.57 minutes
Model sampling (Bernoulli Logit) took 2614.59 seconds
Model sampling took 43.58 minutes
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
Chain 1 -   done: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [44:36<00:00, 53.53s/it]
Model optimization (Bernoulli Logit Reduce Sum) took 107.16 seconds
Model optimization took 1.79 minutes
Model sampling (Bernoulli Logit Reduce Sum) took 2703.18 seconds
Model sampling took 45.05 minutes
``````

I’ve attached the minimal working examples here. Is my reduce-sum function not correct?

X_train.csv (63.8 KB) y_train.csv (31.6 KB) model_optimization.py (2.1 KB) gpm_model_bernoullilogit_reducesum.stan (776 Bytes) gpm_model_bernoullilogit.stan (437 Bytes)

One of the tricky things with reduce sum is that only the first argument `y_slice` gets split into chunks and passed to the separate threads. The other parameters get copied in full. While this is fine for parameters with only a couple of elements, when the parameters are the same size as the outcome the copy cost can outweigh the speed-up.

What you can try is computing `eta * L_k` separately (I.e. `f`) and have that to be the parameter to be sliced, and `y` to be copied. Copying `int` parameters is pretty cheap

1 Like

Hi @andrjohns, I’m working on trying to slice f and running into some trouble:

``````functions {
real partial_sum(vector f,
int start, int end,
int[] y,
vector eta,
matrix x) {
f = x[start:end]*eta[start:end];
return bernoulli_logit_lpmf(y[start:end] | f);
}
}
data{
int<lower=1> N;
row_vector[2] x[N];
int<lower=0,upper=1> y[N];
}
transformed data{
real delta = 1e-9;
}
parameters{
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}
model{
int grainsize = 250;

target += reduce_sum(partial_sum,
grainsize,
y,
eta, L_K);

eta ~ std_normal();
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();
}
Semantic error in '/home/lemoinelab2/Documents/server-share/nate/Phenology_Species_Distributions/Code/model_optimization/gpm_model_bernoullilogit_reducesum_f.stan', line 7, column 30 to col
umn 62:
-------------------------------------------------
5:                     vector eta,
6:                     matrix x) {
7:                                f = x[start:end]*eta[start:end];
^
8:                                return bernoulli_logit_lpmf(y[start:end] | f);
9:                                }
-------------------------------------------------

Cannot assign to function argument or loop identifier 'f'.
``````

Also, does x in the function need to be x[start:end,start:end]? I’m a little confused how the matrix indexing works.

One thing I have tried: I hadn’t standardized (N(0,1)) the predictors. I changed that so that both predictors are now N(0,1) and now the reduce sum gives me a huge speedup. The original model takes about 60 minutes to sample, and the reduce sum model does it in 12. So standardization appears to have sped things up in parallel, but not when run singly.

The issue is that the `partial_sum` function is expecting `f` to be passed externally, but you’re computing it internally. Additionally, the arguments are out of order. Try the following:

``````functions {
real partial_sum(real[] f_slice, int start, int end, int[] y) {
return bernoulli_logit_lpmf(y[start:end] | to_vector(f_slice));
}
}
data{
int<lower=1> N;
row_vector[2] x[N];
int<lower=0,upper=1> y[N];
}
transformed data{
real delta = 1e-9;
}
parameters{
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}
model{
int grainsize = 1;

eta ~ std_normal();
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();

target += reduce_sum(partial_sum, to_array_1d(L_K*eta), grainsize, y);
}
``````

An issue is that `reduce_sum` can’t natively slice over vector inputs, they need to be arrays. Consequently, we need to convert between array and vector types for the call (@wds15 is there an alternative here?).

Also, if you have access to a GPU I’d strongly recommend getting that setup, as it could accelerate both the cholesky decomposition and matrix multiplication. It would also be worth comparing the speed of the `_glm` function

No alternative here.

I very much doubt that `reduce_sum` is of any use here. The big thing to wrestle in this model right now is very likely the GP bits. I would

1. switch to the glm sampling statement thing (even if not using a GPU)
2. find ways of how to simplify the GP thing (binning, some approximation)
1 Like

I’ve done some more testing. I compared the base model, the base model using reduce_sum, and a reduce_sum model using the _glm function. The code for the first two is above, here is the code for the _glm model:

``````functions {
real partial_sum(int[] y_slice,
int start, int end,
vector eta,
real a,
matrix x) {
return bernoulli_logit_glm_lpmf(y_slice | x[start:end,start:end], a, eta[start:end]);
}
}
data{
int<lower=1> N;
row_vector[2] x[N];
int<lower=0,upper=1> y[N];
}
transformed data{
real delta = 1e-9;
}
parameters{
real<lower=0> rho;
real<lower=0> alpha;
real a;
vector[N] eta;
}
model{
int grainsize = 250;

target += reduce_sum(partial_sum, y,
grainsize,
eta, a, L_K);

eta ~ std_normal();
rho ~ inv_gamma(5, 5);
alpha ~ std_normal();
a ~ std_normal();
}
``````

And here are the timings (this is only one run, but they are fairly consistent).

Model optimization (Bernoulli Logit) took 171.93 seconds or 2.87 minutes
Model sampling (Bernoulli Logit) took 1176.22 seconds or 19.60 minutes

Model optimization (Bernoulli Logit Reduce Sum) took 17.96 seconds or 0.30 minutes
Model sampling (Bernoulli Logit Reduce Sum) took 1136.27 seconds or 18.94 minutes

Model optimization (Bernoulli Logit Reduce Sum (GLM)) took 34.73 second or 0.58 minutes
Model sampling (Bernoulli Logit Reduce Sum (GLM)) took 1487.41 seconds or 24.79 minutes

My conclusions from all this are: reduce some helped by only when the predictors had been standardized (probably because the priors were too far off for non-standardized variables). The _glm function doesn’t seem to be too different from the non _glm function, unless I did that wrong?

Another caveat is I tried the f slice code above, and it took so long that it never even finished one warmup iteration after several hours.

Just a guess, but you could try adjusting the grainsize again for the GLM model, maybe 250 is just too small.

So 2D GP? In that case you might be interested also in basis function approximation https://arxiv.org/abs/2004.11408 with Stan code at https://github.com/gabriuma/basis_functions_approach_to_GP

This is clever!

1 Like

Another caveat is I tried the f slice code above, and it took so long that it never even finished one warmup iteration after several hours

Yeah, once I had to copy between vector and real array types, I didn’t have a lot of hope.

Can you try the GLM function without reduce_sum, but using GPU acceleration?

There is some additional configuration needed to use the GPU. The setup guide for OpenCL is here: http://mc-stan.org/math/opencl_support.html

To the enable GPU acceleration with `cmdstanPy`, you then need to add `"STAN_OPENCL": True` to the `cpp_options` in your `CmdStanModel` section.

@mitzimorris does the user also need to specify `OPENCL_DEVICE_ID` and `OPENCL_PLATFORM_ID` in `cmdstanPy` or does this get set to some default automatically?

I don’t know if my server has GPU capabilities. Here’s the video info:

``````sudo lshw -C display
*-display
description: VGA compatible controller
product: ASPEED Graphics Family
vendor: ASPEED Technology, Inc.
physical id: 0
bus info: pci@0000:02:00.0
version: 41
width: 32 bits
clock: 33MHz
capabilities: pm msi vga_controller cap_list rom
configuration: driver=ast latency=0
resources: irq:19 memory:91000000-91ffffff memory:92000000-9201ffff ioport:2000(size=128) memory:c0000-dffff``````