Using map_rect with multiple Gaussian processes (sampling time)

AK10 · February 25, 2021, 10:45pm

Dear all,

I’m trying to model multiple realizations of Gaussian processes jointly in stan. Right now I’m fitting them as independent processes but with the same shared hyper-parameters, but later I would like to do some partial pooling, add a common mean, change the likelihood and so on.

I’m comparing a simple vectorized version with a version implemented using map_rect where I map the likelihood across the realizations. My thought was that the latter would be they way to go when I’ll have both shared and individual parameters for the GPs as it would scale well with the number of realizations.

My implementation based on map_rect is, however, not as fast as I expected. I’m running this on a machine with 128 logical cores and they all show full utilization under the map_rect model, while only utilizing 4 cores (corresponding to the number of chains) in the vectorized version.

Is there any reason why my implementation using map_rect is so slower than the vectorized version even though it’s utilizing all my 128 cores, and can I speed it up some how?

Here is some reproducible code.

Vectorized stan program

data {
  int<lower=1> N;   // Number of functions
  int<lower=1> P;   // Number of sampling points
  real t[P];        // Common grid of observation points
  matrix[P, N] y;   // Realizations of the functions
}

parameters {
  real<lower=0> rho;
  real<lower = 0> alpha;
  real<lower=0> sigma;
  matrix[P, N] eta;
}

transformed parameters {
  matrix[P, N] f;
  {
    matrix[P, P] K = cov_exp_quad(t, alpha, rho);
    for (p in 1:P) {
      K[p, p] = K[p, p] + 1e-9; 
    }
    f = cholesky_decompose(K) * eta;
  }
}

model {
  rho ~ inv_gamma(5, 5);
  alpha ~ normal(0, 1);
  sigma ~ normal(0, 1);
  to_vector(eta) ~ normal(0, 1);

  to_vector(y) ~ normal(to_vector(f), sigma);
}

stan program based on map_rect of the likelihood across realizations

functions {
  vector gp_loglik(vector phi, vector theta, real[] x_r, int[] x_i) {
    int P = 30;  // bad hard-coded value for now
    vector[P] f; // latent gaussian process
    
    real alpha = phi[1]; //shared parameters
    real rho = phi[2];
    real sigma = phi[3];
    
    real t[P] = x_r[1:P];         // first P elements are the observation times
    real y[P] = x_r[(P+1):(2*P)]; // second P elements are the observed values
    
    matrix[P, P] K = cov_exp_quad(t, alpha, rho);
    for (p in 1:P) {
      K[p, p] = K[p, p] + 1e-9; 
    }
    f = cholesky_decompose(K) * theta;
    
    return [normal_lpdf(y | f, sigma)]';
  }
}

data {
  int<lower=1> N;   // Number of functions
  int<lower=1> P;   // Number of sampling points
  real t[P];        // Common grid of observation points
  real y[P, N];     // Realizations of the functions
}

transformed data {
  /* Transform data into N shards of 2*P arrays of
     observation times and observed values */
  real x_r[N, 2*P];
  int x_i[N, 2*P];  // unused
  
  for(i in 1:N) {
    x_r[i, 1:P] = t;
    x_r[i, (P+1):(2*P)] = y[,i];
    
    x_i[i,] = rep_array(0, 2*P); // fill it with zeros
  }
}

parameters {
  real<lower=0> rho;
  real<lower = 0> alpha;
  real<lower=0> sigma;
  matrix[P, N] eta;
}

transformed parameters {
  vector[3] phi;       // common parameters
  vector[P] theta[N];  // individual parameters
  
  phi = [alpha, rho, sigma]'; 
  for(i in 1:N) {
    theta[i] = eta[,i];
  }
}

model {
  rho ~ inv_gamma(5, 5);
  alpha ~ normal(0, 1);
  sigma ~ normal(0, 1);
  to_vector(eta) ~ std_normal();

  target += sum(map_rect(gp_loglik, phi, theta, x_r, x_i));
}

And here is some R code that fits the two stan implementations based on simulated data

rm(list=ls())
library(rstan)

options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

simulateProcess <- function(nObs = 50, sigma = 0.1) {
  library(mvtnorm)
  
  covSE <- function(s, t, alpha, rho) {
    alpha^2 * exp(-(s-t)^2 / (2*rho^2))
  }

  #Very close to continuous time approximation
  t <- seq(-0.5, 0.5, length.out = nObs)
  C <- outer(t, t, covSE, alpha=1, rho = 0.1)
  mu <- rep(0, length(t))
  f <- as.numeric(rmvnorm(1, mu, C))
  y <- f + rnorm(nObs, 0, sigma)
  
  list(t = t, f = f, y = y)
}

P <- 30  #number of observation points
N <- 25  #number of GPs

set.seed(12345)
d <- lapply(1:N, function(i) simulateProcess(nObs = P))
t <- d[[1]]$t
y <- sapply(d, function(q) q$y)
f <- sapply(d, function(q) q$f)
sDat <- list(N = N, P = P, t = t, y = y)

m <- stan_model("gp_vectorized.stan")
start_vectorized <- Sys.time()
samp1 <- sampling(m, data = sDat, iter = 10^4, seed=12345)
end_vectorized <- Sys.time()
timing_vectorized <- end_vectorized - start_vectorized

m2 <- stan_model("gp_map.stan")
start_map <- Sys.time()
samp2 <- sampling(m2, data = sDat, iter = 10^4, seed=12345)
end_map <- Sys.time()
timing_map <- end_map - start_map

Here are the timing summaries from running the two implementations on my machine

> timing_vectorized
Time difference of 4.224541 mins
> timing_map
Time difference of 16.89453 mins
>

Thanks a lot in advance for any advice on this.

Funko_Unko · February 26, 2021, 9:35am

To me it looks like there are (at least) two issues:

Some work is being done multiple times in the map_rect version, namely everything that has to do with the matrix K. To me it looks like K and its Cholesky decomposition are globally constant?
In the map_rect version you do many (N) matrix-vector products instead of one matrix-matrix product.

AK10 · February 26, 2021, 10:11am

Thanks a lot for your comment. Yes, you are right that I do everything with the covariance matrix identically in each shard. This will not be the case in the final implementation, as K then will then depend on individual parameters in vector theta. I just did it this way for now to have an identical implementation to the vectorized one.

Maybe the decrease in speed that I see is just a consequence of the increased amount of matrix algebra. I just thought that map_rect would scale better on some many cores (128), so maybe I was doing something wrong.

Funko_Unko · February 26, 2021, 10:15am

This will not be the case in the final implementation

Ah, okay. Then something like that will be unavoidable, but a better comparison would then be for the vectorized version to compute just as many Cholesky decompositions. The question still remains whether this would be the bottleneck.

How many observation Points P and GPs N do you have in the end? I guess if N < number of cores per chain you should not be able to fully use all cores?

AK10 · February 26, 2021, 1:12pm

The question still remains whether this would be the bottleneck.

Yes, you are right. Thank you for your commens. I was also wondering whether anything else could be optimized: like the way I pass in the parameters or the fact that I create a rather large N x 2*P int array filled with zeroes to have something for x_i. I don’t know if any of this matters or could be done differently.

How many observation Points P and GPs N do you have in the end? I guess if N < number of cores per chain you should not be able to fully use all cores?

Probably something like P = 50 and N = 100. I’m also not sure how map_rect distributes the calculations. I’ve just ran an example with N = 10 and 4 chains on 128 cores with options(mc.cores = 128). It shows a full CPU utilization even though 10 * 4 < 128. But I guess that’s a good thing.

Funko_Unko · February 26, 2021, 1:45pm

I’m also not sure how map_rect distributes the calculations.

Yeah, me neither… I’ve also found the documentation (23.2 Map-rect | Stan User’s Guide) to be rather unhelpful…

It seems to suggest that across all iterations, the core->data mapping is constant, implying that the distribution of tasks across cores is known a priori.

Also, I have found no way to actually specify the number of threads to be used per chain, leading Stan to use 12 (my number of logical cores) threads per chain, which kind of seems very counterproductive. Indeed the only time I have seen any speedup for your problem was when I used a single chain which then used 12 threads…

What I find even weirder is that you get a full load with your configuration.

Maybe someone that actually knows map_rect should weigh in :D

AK10 · February 26, 2021, 1:49pm

That would indeed be very appreciated. Thanks a lot for your comments so far though.

Topic		Replies	Views
Catastrophic performance drop with map_rect Modeling paralellization	4	536	May 9, 2023
Seeking expert stan modeler for help speeding up a complex stan model Jobs fitting-issues , specification , performance	4	927	July 29, 2020
Scalable Gaussian process inference Publicity gaussian-process	27	2333	March 25, 2024
Map_rect causing substantial slowdown; trying to understand how to fix Modeling cmdstanr , paralellization	13	875	July 24, 2020
Multithreading with map_rect takes more time Modeling	7	683	July 27, 2020

Using map_rect with multiple Gaussian processes (sampling time)

Related topics