Using map_rect with multiple Gaussian processes (sampling time)

Dear all,

I’m trying to model multiple realizations of Gaussian processes jointly in stan. Right now I’m fitting them as independent processes but with the same shared hyper-parameters, but later I would like to do some partial pooling, add a common mean, change the likelihood and so on.

I’m comparing a simple vectorized version with a version implemented using map_rect where I map the likelihood across the realizations. My thought was that the latter would be they way to go when I’ll have both shared and individual parameters for the GPs as it would scale well with the number of realizations.

My implementation based on map_rect is, however, not as fast as I expected. I’m running this on a machine with 128 logical cores and they all show full utilization under the map_rect model, while only utilizing 4 cores (corresponding to the number of chains) in the vectorized version.

Is there any reason why my implementation using map_rect is so slower than the vectorized version even though it’s utilizing all my 128 cores, and can I speed it up some how?

Here is some reproducible code.


Vectorized stan program

data {
  int<lower=1> N;   // Number of functions
  int<lower=1> P;   // Number of sampling points
  real t[P];        // Common grid of observation points
  matrix[P, N] y;   // Realizations of the functions
}

parameters {
  real<lower=0> rho;
  real<lower = 0> alpha;
  real<lower=0> sigma;
  matrix[P, N] eta;
}

transformed parameters {
  matrix[P, N] f;
  {
    matrix[P, P] K = cov_exp_quad(t, alpha, rho);
    for (p in 1:P) {
      K[p, p] = K[p, p] + 1e-9; 
    }
    f = cholesky_decompose(K) * eta;
  }
}

model {
  rho ~ inv_gamma(5, 5);
  alpha ~ normal(0, 1);
  sigma ~ normal(0, 1);
  to_vector(eta) ~ normal(0, 1);

  to_vector(y) ~ normal(to_vector(f), sigma);
}

stan program based on map_rect of the likelihood across realizations

functions {
  vector gp_loglik(vector phi, vector theta, real[] x_r, int[] x_i) {
    int P = 30;  // bad hard-coded value for now
    vector[P] f; // latent gaussian process
    
    real alpha = phi[1]; //shared parameters
    real rho = phi[2];
    real sigma = phi[3];
    
    real t[P] = x_r[1:P];         // first P elements are the observation times
    real y[P] = x_r[(P+1):(2*P)]; // second P elements are the observed values
    
    matrix[P, P] K = cov_exp_quad(t, alpha, rho);
    for (p in 1:P) {
      K[p, p] = K[p, p] + 1e-9; 
    }
    f = cholesky_decompose(K) * theta;
    
    return [normal_lpdf(y | f, sigma)]';
  }
}

data {
  int<lower=1> N;   // Number of functions
  int<lower=1> P;   // Number of sampling points
  real t[P];        // Common grid of observation points
  real y[P, N];     // Realizations of the functions
}

transformed data {
  /* Transform data into N shards of 2*P arrays of
     observation times and observed values */
  real x_r[N, 2*P];
  int x_i[N, 2*P];  // unused
  
  for(i in 1:N) {
    x_r[i, 1:P] = t;
    x_r[i, (P+1):(2*P)] = y[,i];
    
    x_i[i,] = rep_array(0, 2*P); // fill it with zeros
  }
}

parameters {
  real<lower=0> rho;
  real<lower = 0> alpha;
  real<lower=0> sigma;
  matrix[P, N] eta;
}

transformed parameters {
  vector[3] phi;       // common parameters
  vector[P] theta[N];  // individual parameters
  
  phi = [alpha, rho, sigma]'; 
  for(i in 1:N) {
    theta[i] = eta[,i];
  }
}

model {
  rho ~ inv_gamma(5, 5);
  alpha ~ normal(0, 1);
  sigma ~ normal(0, 1);
  to_vector(eta) ~ std_normal();

  target += sum(map_rect(gp_loglik, phi, theta, x_r, x_i));
}

And here is some R code that fits the two stan implementations based on simulated data

rm(list=ls())
library(rstan)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

simulateProcess <- function(nObs = 50, sigma = 0.1) {
  library(mvtnorm)
  
  covSE <- function(s, t, alpha, rho) {
    alpha^2 * exp(-(s-t)^2 / (2*rho^2))
  }

  #Very close to continuous time approximation
  t <- seq(-0.5, 0.5, length.out = nObs)
  C <- outer(t, t, covSE, alpha=1, rho = 0.1)
  mu <- rep(0, length(t))
  f <- as.numeric(rmvnorm(1, mu, C))
  y <- f + rnorm(nObs, 0, sigma)
  
  list(t = t, f = f, y = y)
}

P <- 30  #number of observation points
N <- 25  #number of GPs

set.seed(12345)
d <- lapply(1:N, function(i) simulateProcess(nObs = P))
t <- d[[1]]$t
y <- sapply(d, function(q) q$y)
f <- sapply(d, function(q) q$f)
sDat <- list(N = N, P = P, t = t, y = y)

m <- stan_model("gp_vectorized.stan")
start_vectorized <- Sys.time()
samp1 <- sampling(m, data = sDat, iter = 10^4, seed=12345)
end_vectorized <- Sys.time()
timing_vectorized <- end_vectorized - start_vectorized

m2 <- stan_model("gp_map.stan")
start_map <- Sys.time()
samp2 <- sampling(m2, data = sDat, iter = 10^4, seed=12345)
end_map <- Sys.time()
timing_map <- end_map - start_map

Here are the timing summaries from running the two implementations on my machine

> timing_vectorized
Time difference of 4.224541 mins
> timing_map
Time difference of 16.89453 mins
> 

Thanks a lot in advance for any advice on this.

To me it looks like there are (at least) two issues:

  • Some work is being done multiple times in the map_rect version, namely everything that has to do with the matrix K. To me it looks like K and its Cholesky decomposition are globally constant?
  • In the map_rect version you do many (N) matrix-vector products instead of one matrix-matrix product.

Thanks a lot for your comment. Yes, you are right that I do everything with the covariance matrix identically in each shard. This will not be the case in the final implementation, as K then will then depend on individual parameters in vector theta. I just did it this way for now to have an identical implementation to the vectorized one.

Maybe the decrease in speed that I see is just a consequence of the increased amount of matrix algebra. I just thought that map_rect would scale better on some many cores (128), so maybe I was doing something wrong.

This will not be the case in the final implementation

Ah, okay. Then something like that will be unavoidable, but a better comparison would then be for the vectorized version to compute just as many Cholesky decompositions. The question still remains whether this would be the bottleneck.

How many observation Points P and GPs N do you have in the end? I guess if N < number of cores per chain you should not be able to fully use all cores?

The question still remains whether this would be the bottleneck.

Yes, you are right. Thank you for your commens. I was also wondering whether anything else could be optimized: like the way I pass in the parameters or the fact that I create a rather large N x 2*P int array filled with zeroes to have something for x_i. I don’t know if any of this matters or could be done differently.

How many observation Points P and GPs N do you have in the end? I guess if N < number of cores per chain you should not be able to fully use all cores?

Probably something like P = 50 and N = 100. I’m also not sure how map_rect distributes the calculations. I’ve just ran an example with N = 10 and 4 chains on 128 cores with options(mc.cores = 128). It shows a full CPU utilization even though 10 * 4 < 128. But I guess that’s a good thing.

I’m also not sure how map_rect distributes the calculations.

Yeah, me neither… I’ve also found the documentation (23.2 Map-rect | Stan User’s Guide) to be rather unhelpful…

It seems to suggest that across all iterations, the core->data mapping is constant, implying that the distribution of tasks across cores is known a priori.

Also, I have found no way to actually specify the number of threads to be used per chain, leading Stan to use 12 (my number of logical cores) threads per chain, which kind of seems very counterproductive. Indeed the only time I have seen any speedup for your problem was when I used a single chain which then used 12 threads…

What I find even weirder is that you get a full load with your configuration.

Maybe someone that actually knows map_rect should weigh in :D

That would indeed be very appreciated. Thanks a lot for your comments so far though.