Help speeding up bernoulli Gaussian process model

Hi all,
I’m fitting a bernoulli Gaussian process model for species distribution modeling, and it’s quite slow. I was wondering if anyone had any tips to help speed it up.

Essentially, I have species occurrences (presence only) downloaded from GBIF. For each species there’s probably 200-500 presences (1’s). To generate ‘pseudo-absences’, I randomly sample 1000 background points from the map (i.e. North America) and define them all as absences (0’s). So I have somewhere between 1200-2000 points for fitting a bernoulli GPM.

I’ve followed the code in the STAN manual, and have my model as:

Please share your Stan program and accompanying data if possible.


data{
	int<lower=1> N;
	row_vector[2] x[N];
	int<lower=0,upper=1> y[N];
}
transformed data{
	real delta = 1e-9;
}
parameters{
	real<lower=0> rho;
	real<lower=0> alpha;
	real a;
	vector[N] eta ;
}
transformed parameters{
	vector[N] f;
	{
		matrix[N,N] L_K;
		matrix[N,N] K = cov_exp_quad(x, alpha, rho);
		for(n in 1:N){
			K[n,n] = K[n,n] + delta ;
		}
		L_K = cholesky_decompose(K);
		f = L_K * eta;
	}
}
model{
	eta ~ normal(0,1);
	rho ~ inv_gamma(5, 5);
	alpha ~ normal(0,1);
	a ~ normal(0,1);

	y ~ bernoulli_logit(a+f);
}

It works fine, the area-under-curves are good, etc. But it is extremely slow:

Gradient evaluation took 0.910985 seconds                                                                                                                                                     
1000 transitions using 10 leapfrog steps per transition would take 9109.85 seconds.                                                                                                           
Adjust your expectations accordingly!  

I don’t remember it being this slow when I first ran the models a few years ago (I put this project aside and came back to it). I tried to speed it up by first optimizing the model, then using the optimized parameters as the starting values, but I’m not sure that helps too much. Does anyone have any advice for speeding this one up?

Thanks

FWIW, here’s the optimizer:

Initial log joint probability = -1736.59                                                                                                                                                      
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
Exception: cholesky_decompose: Matrix m is not positive definite  (in 'unknown file name' at line 22)                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                              
      19       -704.81      0.492961       41.2802           1           1       28                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes             
      39      -526.872      0.789057       15.5561      0.5514      0.5514       52                                                                                                  [150/527]
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
      59      -482.263      0.117319       11.0234           1           1       73                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
      79      -467.499     0.0949339       17.2224           1           1       93                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
      99      -463.204      0.176057       7.71779           1           1      114                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
     119      -459.926      0.280257       8.86097           1           1      138                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
     139      -451.633      0.368367       22.6551           1           1      160                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
     159      -432.905      0.120548       18.6509      0.4404      0.4404      181                                                                                                           
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes                                                                                                    
     179      -413.768      0.179617       22.0664      0.4095           1      204
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     199      -396.866      0.150389       33.9249      0.1559           1      227
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     219      -374.362      0.675426       31.9265           1           1      248

     239      -353.261      0.278901       22.4105           1           1      269                                                                                                  [130/527]
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     259       -334.69      0.139741       27.4685      0.7923      0.7923      290
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     279      -315.248      0.111221        19.985           1           1      311
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     299      -295.522      0.192555       32.7448       0.377           1      334
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     319       -280.72     0.0874414       10.7889           1           1      355
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     339      -271.754     0.0669797       6.88682           1           1      375
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     359       -269.35      0.061617       2.52318           1           1      396
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     379      -268.969     0.0416614       1.23482           1           1      420
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     399      -268.936    0.00309279      0.364431           1           1      442
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     419      -268.932    0.00455218       0.22627           1           1      464
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     439      -268.931   0.000199942     0.0384264      0.5265      0.5265      488
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes
     452      -268.931   0.000113569     0.0131722      0.3661           1      502

New update.
I moved the entire transformed parameters section into the model block, and that speed things up so much. The iterations have been cut down by an order of magnitude, and the leapfrog estimates are half the time:

Initial log joint probability = -1839.18
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      19      -698.413      0.829057       13.1064       1.274      0.1274       23   
    Iter      log prob        ||dx||      ||grad||       alpha      alpha0  # evals  Notes 
      39      -697.594    0.00118772    0.00106796      0.7247      0.7247       50   
Optimization terminated normally: 
  Convergence detected: relative gradient magnitude is below tolerance

Gradient evaluation took 0.436195 seconds
1000 transitions using 10 leapfrog steps per transition would take 4361.95 seconds.
Adjust your expectations accordingly!

How is it that it’s so much faster inside the model block than inside transformed parameters?

1 Like

I wouldn’t expect there to be that dramatic a speed-up actually, but there might be a reason. @rok_cesnovar do you know if variables declared in the model block transpile to double or var types?

As for the rest of your model, you have some other room for optimising the code. A couple of small things first:

You can simplify your construction of K by using the add_diag function instead of a loop:

matrix[N,N] K = add_diag(cov_exp_quad(x, alpha, rho), delta);

For your priors, you should use std_normal() rather than normal(0,1).

For some bigger things, you should look into using the cmdstanR package for two reasons: GPU acceleration and within-chain parallelisation. You’ll be able to offload cholesky decomposition and multiplication ofL_K and eta onto the GPU, which should cut the time down. You can then also look into using the reduce_sum framework for parallelising the bernoulli_logit call, more information here: https://mc-stan.org/docs/2_25/stan-users-guide/reduce-sum.html

1 Like

Thanks for the tips! One issue I bumped into is that I’m using PyStan, which hasn’t been updated since 2.19.1.1. So I don’t have access to the latest StanMath, which includes add_diag, etc.

No worries, in that case you’ll probably want to look at cmdstanPy: https://pypi.org/project/cmdstanpy/, which uses the latest version of Stan

Another potential explanation is that variables defined in the model block don’t get written to the csv, so if you’re dealing with lots and a slow write medium, then this can make a big difference.

Nice! That’s new to me even as a GP-enthusiast.

Oh yeah, good call with the writing to disk, that could definitely have an impact.

@Nathan_Lemoine After looking at your model again, I’ve realised you can optimise your model even further by using the bernoulli_logit_glm distribution. If we look at your likelihood:

y ~ bernoulli_logit(a+f);

and note that this is the same as:

y ~ bernoulli_logit(a + L_K * eta);

This means that you can treat the model as GLM where a is the intercept, L_K is the matrix of covariates, eta is the vector of coefficients:

  y ~ bernoulli_logit_glm(L_K, a, eta);

This glm distribution is more optimised and efficient than constructing the multiplication of L_K * eta separately. Additionally, as of Stan 2.25 (available through cmdstanPy) this can be evaluated on the GPU, providing an even bigger speed-up.

Combining that with my previous suggestions, your model could look like:

data{
  int<lower=1> N;
  row_vector[2] x[N];
  int<lower=0,upper=1> y[N];
}
transformed data{
  real delta = 1e-9;
}
parameters{
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;
}

model{
  matrix[N,N] L_K;

  eta ~ std_normal();
  rho ~ inv_gamma(5, 5);
  alpha ~ std_normal();
  a ~ std_normal();

  L_K = cholesky_decompose(add_diag(cov_exp_quad(x, alpha, rho),
                                    delta));

  y ~ bernoulli_logit_glm(L_K, a, eta);
}

Let me know if you have any questions about that

5 Likes

Huh, so I guess in fact most GPs should be using the _glm() functions, eh? For example, with a gaussian-noise outcome you’d do y ~ normal_id_glm(f,0,z,sigma) where z is the std_normal() parameter used to enable the cholesky representation of the latent GP f and sigma is the measurement noise term. Super cool. I wonder what kinds of speedup this has? It’s funny how I consider myself pretty attentive to developments in the language but still manage to miss stuff like this!

Haha it moves pretty fast, not always easy to keep up (for me as well!).

Yeah, in general the _glm functions will be more efficient in their construction of both the likelihood and its gradients. But in the past they could only be offloaded onto the GPU when both x and y were data, not parameters, which limited their applicability to something like this.

Thanks to Tadej’s consistently excellent work, as of 2.25 this restriction was removed so now any input can be a parameter. This means there should be some pretty good speed-ups available for GPs now

1 Like

Thanks all!! I’ll give these a try and see what happens.

The trick is that I need the variable f written to disk since I’m saving it as the posterior prediction for mapping later on, so I don’t know if I can move it. Unless putting it in generated quantities would somehow be faster? Not sure.

Give it a try, I’d be curious to see what your result is. Since the samples are written to file once, it shouldn’t actually make a difference. But in retrospect, my intuition is actually that f isn’t nearly large enough to make your likely storage medium speeds a factor, so I’m wondering if maybe you have a RAM bottleneck or something such that keeping the variable around during the main sampling procedure makes everything slower? Vars in the GQ get computed once at the end, once the sample is accepted, while TPs stick around in memory as the algorithm does its HMC stuff (and I think there’s a bunch of “copies” from different points along the trajectory).

I’ve tried using the reduce sum framework and running a chain in parallel. I saw no speedup:

Chain 1 -   done: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [43:25<00:00, 52.11s/it]
Model optimization (Bernoulli Logit) took 94.10 seconds
Model optimization took 1.57 minutes
Model sampling (Bernoulli Logit) took 2614.59 seconds
Model sampling took 43.58 minutes
INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1
Chain 1 -   done: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [44:36<00:00, 53.53s/it]
Model optimization (Bernoulli Logit Reduce Sum) took 107.16 seconds
Model optimization took 1.79 minutes
Model sampling (Bernoulli Logit Reduce Sum) took 2703.18 seconds
Model sampling took 45.05 minutes

I’ve attached the minimal working examples here. Is my reduce-sum function not correct?

X_train.csv (63.8 KB) y_train.csv (31.6 KB) model_optimization.py (2.1 KB) gpm_model_bernoullilogit_reducesum.stan (776 Bytes) gpm_model_bernoullilogit.stan (437 Bytes)

One of the tricky things with reduce sum is that only the first argument y_slice gets split into chunks and passed to the separate threads. The other parameters get copied in full. While this is fine for parameters with only a couple of elements, when the parameters are the same size as the outcome the copy cost can outweigh the speed-up.

What you can try is computing eta * L_k separately (I.e. f) and have that to be the parameter to be sliced, and y to be copied. Copying int parameters is pretty cheap

1 Like

Hi @andrjohns, I’m working on trying to slice f and running into some trouble:

functions {
  real partial_sum(vector f,
                   int start, int end,
                   int[] y,
                   vector eta,
                   matrix x) {
                              f = x[start:end]*eta[start:end];
                              return bernoulli_logit_lpmf(y[start:end] | f);
                              }
}
data{
  int<lower=1> N;
  row_vector[2] x[N];
  int<lower=0,upper=1> y[N];
}
transformed data{
  real delta = 1e-9;
}
parameters{
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;
}
model{
    int grainsize = 250;
    matrix[N,N] L_K = cholesky_decompose(add_diag(cov_exp_quad(x, alpha, rho), delta));
    
    target += reduce_sum(partial_sum,
                         grainsize,
                         y,
                         eta, L_K);

    eta ~ std_normal();
    rho ~ inv_gamma(5, 5);
    alpha ~ std_normal();
    a ~ std_normal();
}
 Semantic error in '/home/lemoinelab2/Documents/server-share/nate/Phenology_Species_Distributions/Code/model_optimization/gpm_model_bernoullilogit_reducesum_f.stan', line 7, column 30 to col
umn 62:                                                                                                                                                                                       
   -------------------------------------------------                                                                                                                                          
     5:                     vector eta,                                                                                                                                                       
     6:                     matrix x) {                                                                                                                                                       
     7:                                f = x[start:end]*eta[start:end];                                                                                                                       
                                       ^                                                                                                                                                      
     8:                                return bernoulli_logit_lpmf(y[start:end] | f);
     9:                                }
   -------------------------------------------------

Cannot assign to function argument or loop identifier 'f'.

Also, does x in the function need to be x[start:end,start:end]? I’m a little confused how the matrix indexing works.

One thing I have tried: I hadn’t standardized (N(0,1)) the predictors. I changed that so that both predictors are now N(0,1) and now the reduce sum gives me a huge speedup. The original model takes about 60 minutes to sample, and the reduce sum model does it in 12. So standardization appears to have sped things up in parallel, but not when run singly.

The issue is that the partial_sum function is expecting f to be passed externally, but you’re computing it internally. Additionally, the arguments are out of order. Try the following:

functions {
  real partial_sum(real[] f_slice, int start, int end, int[] y) {
    return bernoulli_logit_lpmf(y[start:end] | to_vector(f_slice));
  }
}
data{
  int<lower=1> N;
  row_vector[2] x[N];
  int<lower=0,upper=1> y[N];
}
transformed data{
  real delta = 1e-9;
}
parameters{
  real<lower=0> rho;
  real<lower=0> alpha;
  real a;
  vector[N] eta;
}
model{
    int grainsize = 1;
    matrix[N,N] L_K = cholesky_decompose(add_diag(cov_exp_quad(x, alpha, rho), delta));

    eta ~ std_normal();
    rho ~ inv_gamma(5, 5);
    alpha ~ std_normal();
    a ~ std_normal();

    target += reduce_sum(partial_sum, to_array_1d(L_K*eta), grainsize, y);
}

An issue is that reduce_sum can’t natively slice over vector inputs, they need to be arrays. Consequently, we need to convert between array and vector types for the call (@wds15 is there an alternative here?).

Also, if you have access to a GPU I’d strongly recommend getting that setup, as it could accelerate both the cholesky decomposition and matrix multiplication. It would also be worth comparing the speed of the _glm function

No alternative here.

I very much doubt that reduce_sum is of any use here. The big thing to wrestle in this model right now is very likely the GP bits. I would

  1. switch to the glm sampling statement thing (even if not using a GPU)
  2. find ways of how to simplify the GP thing (binning, some approximation)
  3. maybe add reduce_sum
1 Like

I’ve done some more testing. I compared the base model, the base model using reduce_sum, and a reduce_sum model using the _glm function. The code for the first two is above, here is the code for the _glm model:

functions {
      real partial_sum(int[] y_slice,
                       int start, int end,
                       vector eta,
                       real a,
                       matrix x) {
      return bernoulli_logit_glm_lpmf(y_slice | x[start:end,start:end], a, eta[start:end]);
      }
    }
    data{
      int<lower=1> N;
      row_vector[2] x[N];
      int<lower=0,upper=1> y[N];
    }
    transformed data{
      real delta = 1e-9;
    }
    parameters{
      real<lower=0> rho;
      real<lower=0> alpha;
      real a;
      vector[N] eta;
    }
    model{
        int grainsize = 250;
        matrix[N,N] L_K = cholesky_decompose(add_diag(cov_exp_quad(x, alpha, rho), delta));
        
        target += reduce_sum(partial_sum, y,
                             grainsize,
                             eta, a, L_K);

        eta ~ std_normal();
        rho ~ inv_gamma(5, 5);
        alpha ~ std_normal();
        a ~ std_normal();
    }

And here are the timings (this is only one run, but they are fairly consistent).

Model optimization (Bernoulli Logit) took 171.93 seconds or 2.87 minutes
Model sampling (Bernoulli Logit) took 1176.22 seconds or 19.60 minutes

Model optimization (Bernoulli Logit Reduce Sum) took 17.96 seconds or 0.30 minutes
Model sampling (Bernoulli Logit Reduce Sum) took 1136.27 seconds or 18.94 minutes

Model optimization (Bernoulli Logit Reduce Sum (GLM)) took 34.73 second or 0.58 minutes
Model sampling (Bernoulli Logit Reduce Sum (GLM)) took 1487.41 seconds or 24.79 minutes

My conclusions from all this are: reduce some helped by only when the predictors had been standardized (probably because the priors were too far off for non-standardized variables). The _glm function doesn’t seem to be too different from the non _glm function, unless I did that wrong?

Another caveat is I tried the f slice code above, and it took so long that it never even finished one warmup iteration after several hours.

Just a guess, but you could try adjusting the grainsize again for the GLM model, maybe 250 is just too small.

So 2D GP? In that case you might be interested also in basis function approximation https://arxiv.org/abs/2004.11408 with Stan code at https://github.com/gabriuma/basis_functions_approach_to_GP

This is clever!

1 Like

Another caveat is I tried the f slice code above, and it took so long that it never even finished one warmup iteration after several hours

Yeah, once I had to copy between vector and real array types, I didn’t have a lot of hope.

Can you try the GLM function without reduce_sum, but using GPU acceleration?

There is some additional configuration needed to use the GPU. The setup guide for OpenCL is here: http://mc-stan.org/math/opencl_support.html

To the enable GPU acceleration with cmdstanPy, you then need to add "STAN_OPENCL": True to the cpp_options in your CmdStanModel section.

@mitzimorris does the user also need to specify OPENCL_DEVICE_ID and OPENCL_PLATFORM_ID in cmdstanPy or does this get set to some default automatically?

I don’t know if my server has GPU capabilities. Here’s the video info:

sudo lshw -C display            
      *-display                 
           description: VGA compatible controller
           product: ASPEED Graphics Family
           vendor: ASPEED Technology, Inc.
           physical id: 0
           bus info: pci@0000:02:00.0
           version: 41
           width: 32 bits
           clock: 33MHz
           capabilities: pm msi vga_controller cap_list rom
           configuration: driver=ast latency=0
           resources: irq:19 memory:91000000-91ffffff memory:92000000-9201ffff ioport:2000(size=128) memory:c0000-dffff