Dear all,

I’m trying to model multiple realizations of Gaussian processes jointly in stan. Right now I’m fitting them as independent processes but with the same shared hyper-parameters, but later I would like to do some partial pooling, add a common mean, change the likelihood and so on.

I’m comparing a simple vectorized version with a version implemented using map_rect where I map the likelihood across the realizations. My thought was that the latter would be they way to go when I’ll have both shared and individual parameters for the GPs as it would scale well with the number of realizations.

My implementation based on map_rect is, however, not as fast as I expected. I’m running this on a machine with 128 logical cores and they all show full utilization under the map_rect model, while only utilizing 4 cores (corresponding to the number of chains) in the vectorized version.

Is there any reason why my implementation using map_rect is so slower than the vectorized version even though it’s utilizing all my 128 cores, and can I speed it up some how?

Here is some reproducible code.

Vectorized stan program

```
data {
int<lower=1> N; // Number of functions
int<lower=1> P; // Number of sampling points
real t[P]; // Common grid of observation points
matrix[P, N] y; // Realizations of the functions
}
parameters {
real<lower=0> rho;
real<lower = 0> alpha;
real<lower=0> sigma;
matrix[P, N] eta;
}
transformed parameters {
matrix[P, N] f;
{
matrix[P, P] K = cov_exp_quad(t, alpha, rho);
for (p in 1:P) {
K[p, p] = K[p, p] + 1e-9;
}
f = cholesky_decompose(K) * eta;
}
}
model {
rho ~ inv_gamma(5, 5);
alpha ~ normal(0, 1);
sigma ~ normal(0, 1);
to_vector(eta) ~ normal(0, 1);
to_vector(y) ~ normal(to_vector(f), sigma);
}
```

stan program based on map_rect of the likelihood across realizations

```
functions {
vector gp_loglik(vector phi, vector theta, real[] x_r, int[] x_i) {
int P = 30; // bad hard-coded value for now
vector[P] f; // latent gaussian process
real alpha = phi[1]; //shared parameters
real rho = phi[2];
real sigma = phi[3];
real t[P] = x_r[1:P]; // first P elements are the observation times
real y[P] = x_r[(P+1):(2*P)]; // second P elements are the observed values
matrix[P, P] K = cov_exp_quad(t, alpha, rho);
for (p in 1:P) {
K[p, p] = K[p, p] + 1e-9;
}
f = cholesky_decompose(K) * theta;
return [normal_lpdf(y | f, sigma)]';
}
}
data {
int<lower=1> N; // Number of functions
int<lower=1> P; // Number of sampling points
real t[P]; // Common grid of observation points
real y[P, N]; // Realizations of the functions
}
transformed data {
/* Transform data into N shards of 2*P arrays of
observation times and observed values */
real x_r[N, 2*P];
int x_i[N, 2*P]; // unused
for(i in 1:N) {
x_r[i, 1:P] = t;
x_r[i, (P+1):(2*P)] = y[,i];
x_i[i,] = rep_array(0, 2*P); // fill it with zeros
}
}
parameters {
real<lower=0> rho;
real<lower = 0> alpha;
real<lower=0> sigma;
matrix[P, N] eta;
}
transformed parameters {
vector[3] phi; // common parameters
vector[P] theta[N]; // individual parameters
phi = [alpha, rho, sigma]';
for(i in 1:N) {
theta[i] = eta[,i];
}
}
model {
rho ~ inv_gamma(5, 5);
alpha ~ normal(0, 1);
sigma ~ normal(0, 1);
to_vector(eta) ~ std_normal();
target += sum(map_rect(gp_loglik, phi, theta, x_r, x_i));
}
```

And here is some R code that fits the two stan implementations based on simulated data

```
rm(list=ls())
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
simulateProcess <- function(nObs = 50, sigma = 0.1) {
library(mvtnorm)
covSE <- function(s, t, alpha, rho) {
alpha^2 * exp(-(s-t)^2 / (2*rho^2))
}
#Very close to continuous time approximation
t <- seq(-0.5, 0.5, length.out = nObs)
C <- outer(t, t, covSE, alpha=1, rho = 0.1)
mu <- rep(0, length(t))
f <- as.numeric(rmvnorm(1, mu, C))
y <- f + rnorm(nObs, 0, sigma)
list(t = t, f = f, y = y)
}
P <- 30 #number of observation points
N <- 25 #number of GPs
set.seed(12345)
d <- lapply(1:N, function(i) simulateProcess(nObs = P))
t <- d[[1]]$t
y <- sapply(d, function(q) q$y)
f <- sapply(d, function(q) q$f)
sDat <- list(N = N, P = P, t = t, y = y)
m <- stan_model("gp_vectorized.stan")
start_vectorized <- Sys.time()
samp1 <- sampling(m, data = sDat, iter = 10^4, seed=12345)
end_vectorized <- Sys.time()
timing_vectorized <- end_vectorized - start_vectorized
m2 <- stan_model("gp_map.stan")
start_map <- Sys.time()
samp2 <- sampling(m2, data = sDat, iter = 10^4, seed=12345)
end_map <- Sys.time()
timing_map <- end_map - start_map
```

Here are the timing summaries from running the two implementations on my machine

```
> timing_vectorized
Time difference of 4.224541 mins
> timing_map
Time difference of 16.89453 mins
>
```

Thanks a lot in advance for any advice on this.