Hello Stan Community!

In a previous post I was asking for your support to efficiently implement within-chain parallelization via `reduce_sum`

in a Bayesian PCA. I’ve made some progress, however, got stuck again.

Aiming at reproducing my issues, I’ve implemented a Bayesian PCA for simulated data. Assuming that

where \mathbf{W}\in\mathbb{R}^{D \times M} with M<D, \mathbf{z}_{n} \sim \mathcal{N}_{M}(\mathbf{0},\mathbf{I}), and \boldsymbol{\epsilon}_{n} \sim \mathcal{N}_{D}(\mathbf{0},\sigma^{2}\mathbf{I}) for n=1,2,\ldots,N.

Furthermore, proposing some priors for the latent vectors and hyperparameters, the Bayesian PCA model is as follows:

where \mathbf{w}_{m} is the m column of \mathbf{W}.

Using the following \mathsf{R} and \mathsf{Stan} code you can simulate data and implement Bayesian PCA in a sequential and parallelized manner, respectively.

```
library(cmdstanr)
library(posterior)
library(bayesplot)
### Simulated data
N<-200
D<-20
M<-5
set.seed(1987)
Z<-matrix(data=rnorm(N*M),nrow=N,ncol=M)
sigma_w<-rep(1,M)
Wt<-matrix(NA,nrow=M,ncol=D)
for(m in 1:M){
Wt[m,]=rnorm(D,0,sigma_w[m])
}
sigma_x<-1
X<-Z%*%Wt+matrix(data=rnorm(N*D,0,sigma_x),nrow=N,ncol=D)
### Inference
data <- list(N=N,D=D,M=10,X=X)
## Sequential version
mod_seq <- cmdstan_model(stan_file="bpca_seq.stan",
cpp_options=list(stan_threads=FALSE))
fit_seq <- mod_seq$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1)
## Parallelized version
mod_par <- cmdstan_model(stan_file="bpca_par.stan",
cpp_options=list(stan_threads=TRUE))
fit_par <- mod_par$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1,threads_per_chain=16)
## Estimates
fit_seq$summary(variables=c("tau_x","tau_w"))
fit_par$summary(variables=c("tau_x","tau_w"))
## Timing
fit_seq$profiles()
fit_par$profiles()
```

```
data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
tau_x ~ gamma(1,1);
tau_w ~ gamma(.001,.001);
for (m in 1:M){
Wt[m,] ~ normal(0,sigma_w[m]);
}
to_vector(Z) ~ normal(0,1);
}
profile("likelihood") {
to_vector(X) ~ normal(to_vector(mu),sigma_x);
}
}
```

```
functions {
real partial_sum1(array[] real X_slice,
int start, int end,
array[] real mu,
real sigma) {
return normal_lpdf(X_slice | mu[start:end],sigma);
}
real partial_sum2(array[,] real X_slice,
int start, int end,
matrix Z,
matrix Wt,
real sigma) {
return normal_lpdf(to_array_1d(to_matrix(X_slice)) | to_array_1d(Z[start:end]*Wt),sigma);
}
}
data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
transformed data {
//int grainsize = 4000;
int grainsize = 200;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
//matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
target += gamma_lpdf(tau_x | 1,1);
target += gamma_lpdf(tau_w | 0.001,0.001);
for (m in 1:M){
target += normal_lpdf(Wt[m,] | 0,sigma_w[m]);
}
target += normal_lpdf(to_vector(Z) | 0,1);
}
profile("likelihood") {
//target += reduce_sum(partial_sum1,to_array_1d(X),grainsize,to_array_1d(mu),sigma_x);
target += reduce_sum(partial_sum2,to_array_2d(X),grainsize,Z,Wt,sigma_x);
}
}
```

As I mentioned above, my goal is to efficiently implement within-chain parallelization via `reduce_sum`

. To do so, I’ve implemented one alternative for the sequential version and two alternatives for the parallelized version (by means of using the second and third script, respectively). Unfortunately, the latter ones are slower than the former one.

Interestingly, I’ve found a conflict with the structures that the `reduce_sum`

function uses as inputs when operating at multivariate settings. On the one hand, `reduce_sum`

needs a function, such as `partial_sum1`

or `partial_sum2`

, as first argument; an array, such as `to_array_2d(X)`

, as second argument; and so forth. On the other hand, `normal_lpdf`

needs 1-dimensional arrays as first and second argument. Then, my best guess is to provide slices of \mathbf{X} as a 2-dimensional array and cast it as a 1-dimensional array as `to_array_1d(to_matrix(X_slice))`

and the corresponding slice of \mathbf{Z}\mathbf{W}^{t} as `to_array_1d(Z[start:end]*Wt)`

.

The funny part is that if I type `to_array_1d(X_slice)`

instead of `to_array_1d(to_matrix(X_slice))`

the computational speedup is around 4 times, however, the estimates ‘don’t match’ those ones obtained by the sequential version. This is because the function `to_array_1d()`

operates in a different manner for arrays and matrices. By the way, could someone please enlighten me about how to properly use `reduce_sum`

when dealing with matrices? I proceeded by assuming that I can slice by rows given the model specification.

Thus, I was wondering if it’s possible to modify my code in order to get the computational speedup as when not using the function `to_matrix()`

. I’ve tried different alternatives, yet all of them demand applying casting functions which ultimately increase the computational burden.

All the best!

Roman.