Hello Stan Community!
In a previous post I was asking for your support to efficiently implement within-chain parallelization via reduce_sum
in a Bayesian PCA. I’ve made some progress, however, got stuck again.
Aiming at reproducing my issues, I’ve implemented a Bayesian PCA for simulated data. Assuming that
where \mathbf{W}\in\mathbb{R}^{D \times M} with M<D, \mathbf{z}_{n} \sim \mathcal{N}_{M}(\mathbf{0},\mathbf{I}), and \boldsymbol{\epsilon}_{n} \sim \mathcal{N}_{D}(\mathbf{0},\sigma^{2}\mathbf{I}) for n=1,2,\ldots,N.
Furthermore, proposing some priors for the latent vectors and hyperparameters, the Bayesian PCA model is as follows:
where \mathbf{w}_{m} is the m column of \mathbf{W}.
Using the following \mathsf{R} and \mathsf{Stan} code you can simulate data and implement Bayesian PCA in a sequential and parallelized manner, respectively.
library(cmdstanr)
library(posterior)
library(bayesplot)
### Simulated data
N<-200
D<-20
M<-5
set.seed(1987)
Z<-matrix(data=rnorm(N*M),nrow=N,ncol=M)
sigma_w<-rep(1,M)
Wt<-matrix(NA,nrow=M,ncol=D)
for(m in 1:M){
Wt[m,]=rnorm(D,0,sigma_w[m])
}
sigma_x<-1
X<-Z%*%Wt+matrix(data=rnorm(N*D,0,sigma_x),nrow=N,ncol=D)
### Inference
data <- list(N=N,D=D,M=10,X=X)
## Sequential version
mod_seq <- cmdstan_model(stan_file="bpca_seq.stan",
cpp_options=list(stan_threads=FALSE))
fit_seq <- mod_seq$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1)
## Parallelized version
mod_par <- cmdstan_model(stan_file="bpca_par.stan",
cpp_options=list(stan_threads=TRUE))
fit_par <- mod_par$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1,threads_per_chain=16)
## Estimates
fit_seq$summary(variables=c("tau_x","tau_w"))
fit_par$summary(variables=c("tau_x","tau_w"))
## Timing
fit_seq$profiles()
fit_par$profiles()
data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
tau_x ~ gamma(1,1);
tau_w ~ gamma(.001,.001);
for (m in 1:M){
Wt[m,] ~ normal(0,sigma_w[m]);
}
to_vector(Z) ~ normal(0,1);
}
profile("likelihood") {
to_vector(X) ~ normal(to_vector(mu),sigma_x);
}
}
functions {
real partial_sum1(array[] real X_slice,
int start, int end,
array[] real mu,
real sigma) {
return normal_lpdf(X_slice | mu[start:end],sigma);
}
real partial_sum2(array[,] real X_slice,
int start, int end,
matrix Z,
matrix Wt,
real sigma) {
return normal_lpdf(to_array_1d(to_matrix(X_slice)) | to_array_1d(Z[start:end]*Wt),sigma);
}
}
data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
transformed data {
//int grainsize = 4000;
int grainsize = 200;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
//matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
target += gamma_lpdf(tau_x | 1,1);
target += gamma_lpdf(tau_w | 0.001,0.001);
for (m in 1:M){
target += normal_lpdf(Wt[m,] | 0,sigma_w[m]);
}
target += normal_lpdf(to_vector(Z) | 0,1);
}
profile("likelihood") {
//target += reduce_sum(partial_sum1,to_array_1d(X),grainsize,to_array_1d(mu),sigma_x);
target += reduce_sum(partial_sum2,to_array_2d(X),grainsize,Z,Wt,sigma_x);
}
}
As I mentioned above, my goal is to efficiently implement within-chain parallelization via reduce_sum
. To do so, I’ve implemented one alternative for the sequential version and two alternatives for the parallelized version (by means of using the second and third script, respectively). Unfortunately, the latter ones are slower than the former one.
Interestingly, I’ve found a conflict with the structures that the reduce_sum
function uses as inputs when operating at multivariate settings. On the one hand, reduce_sum
needs a function, such as partial_sum1
or partial_sum2
, as first argument; an array, such as to_array_2d(X)
, as second argument; and so forth. On the other hand, normal_lpdf
needs 1-dimensional arrays as first and second argument. Then, my best guess is to provide slices of \mathbf{X} as a 2-dimensional array and cast it as a 1-dimensional array as to_array_1d(to_matrix(X_slice))
and the corresponding slice of \mathbf{Z}\mathbf{W}^{t} as to_array_1d(Z[start:end]*Wt)
.
The funny part is that if I type to_array_1d(X_slice)
instead of to_array_1d(to_matrix(X_slice))
the computational speedup is around 4 times, however, the estimates ‘don’t match’ those ones obtained by the sequential version. This is because the function to_array_1d()
operates in a different manner for arrays and matrices. By the way, could someone please enlighten me about how to properly use reduce_sum
when dealing with matrices? I proceeded by assuming that I can slice by rows given the model specification.
Thus, I was wondering if it’s possible to modify my code in order to get the computational speedup as when not using the function to_matrix()
. I’ve tried different alternatives, yet all of them demand applying casting functions which ultimately increase the computational burden.
All the best!
Roman.