# Issues with arrays, matrices, and reduce_sum for Bayesian PCA

Hello Stan Community!

In a previous post I was asking for your support to efficiently implement within-chain parallelization via reduce_sum in a Bayesian PCA. I’ve made some progress, however, got stuck again.

Aiming at reproducing my issues, I’ve implemented a Bayesian PCA for simulated data. Assuming that

\mathbf{x}_{n}=\mathbf{W}\mathbf{z}_{n}+\boldsymbol{\epsilon}_{n},

where \mathbf{W}\in\mathbb{R}^{D \times M} with M<D, \mathbf{z}_{n} \sim \mathcal{N}_{M}(\mathbf{0},\mathbf{I}), and \boldsymbol{\epsilon}_{n} \sim \mathcal{N}_{D}(\mathbf{0},\sigma^{2}\mathbf{I}) for n=1,2,\ldots,N.
Furthermore, proposing some priors for the latent vectors and hyperparameters, the Bayesian PCA model is as follows:

\mathbf{x}_{n}\mid\mathbf{z}_{n} \sim \mathcal{N}_{D}(\mathbf{W}\mathbf{z}_{n},\sigma^{2}\mathbf{I})\\ \mathbf{z}_{n} \sim \mathcal{N}_{M}(\mathbf{0},\mathbf{I})\\ \mathbf{w}_{m}\mid\alpha_{m} \sim \mathcal{N}_{D}(\mathbf{0},\alpha_{m}^{-2}\mathbf{I})\\ \alpha_{m}^{-2} \sim \mathrm{Gamma}(0.001,0.001)\\ \sigma^{-2} \sim \mathrm{Gamma}(1,1)

where \mathbf{w}_{m} is the m column of \mathbf{W}.

Using the following \mathsf{R} and \mathsf{Stan} code you can simulate data and implement Bayesian PCA in a sequential and parallelized manner, respectively.

library(cmdstanr)
library(posterior)
library(bayesplot)

### Simulated data
N<-200
D<-20
M<-5
set.seed(1987)
Z<-matrix(data=rnorm(N*M),nrow=N,ncol=M)
sigma_w<-rep(1,M)
Wt<-matrix(NA,nrow=M,ncol=D)
for(m in 1:M){
Wt[m,]=rnorm(D,0,sigma_w[m])
}
sigma_x<-1
X<-Z%*%Wt+matrix(data=rnorm(N*D,0,sigma_x),nrow=N,ncol=D)
### Inference
data <- list(N=N,D=D,M=10,X=X)
## Sequential version
mod_seq <- cmdstan_model(stan_file="bpca_seq.stan",
fit_seq <- mod_seq$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1) ## Parallelized version mod_par <- cmdstan_model(stan_file="bpca_par.stan", cpp_options=list(stan_threads=TRUE)) fit_par <- mod_par$sample(data=data,seed=87,chains=1,iter_warmup=1000,iter_sampling=15000,parallel_chains=1,threads_per_chain=16)
## Estimates
fit_seq$summary(variables=c("tau_x","tau_w")) fit_par$summary(variables=c("tau_x","tau_w"))
## Timing
fit_seq$profiles() fit_par$profiles()

data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
tau_x ~ gamma(1,1);
tau_w ~ gamma(.001,.001);
for (m in 1:M){
Wt[m,] ~ normal(0,sigma_w[m]);
}
to_vector(Z) ~ normal(0,1);
}
profile("likelihood") {
to_vector(X) ~ normal(to_vector(mu),sigma_x);
}
}

functions {
real partial_sum1(array[] real X_slice,
int start, int end,
array[] real mu,
real sigma) {
return normal_lpdf(X_slice | mu[start:end],sigma);
}
real partial_sum2(array[,] real X_slice,
int start, int end,
matrix Z,
matrix Wt,
real sigma) {
return normal_lpdf(to_array_1d(to_matrix(X_slice)) | to_array_1d(Z[start:end]*Wt),sigma);
}
}
data {
int<lower=1> N;
int<lower=1> D;
int<lower=1> M;
matrix[N,D] X;
}
transformed data {
//int grainsize = 4000;
int grainsize = 200;
}
parameters {
real<lower=0> tau_x;
vector<lower=0>[M] tau_w;
matrix[N,M] Z;
matrix[M,D] Wt;
}
transformed parameters {
//matrix[N,D] mu = Z*Wt;
real<lower=0> sigma_x = inv(sqrt(tau_x));
vector<lower=0>[M] sigma_w = inv(sqrt(tau_w));
}
model {
profile("priors") {
target += gamma_lpdf(tau_x | 1,1);
target += gamma_lpdf(tau_w | 0.001,0.001);
for (m in 1:M){
target += normal_lpdf(Wt[m,] | 0,sigma_w[m]);
}
target += normal_lpdf(to_vector(Z) | 0,1);
}
profile("likelihood") {
//target += reduce_sum(partial_sum1,to_array_1d(X),grainsize,to_array_1d(mu),sigma_x);
target += reduce_sum(partial_sum2,to_array_2d(X),grainsize,Z,Wt,sigma_x);
}
}


As I mentioned above, my goal is to efficiently implement within-chain parallelization via reduce_sum. To do so, I’ve implemented one alternative for the sequential version and two alternatives for the parallelized version (by means of using the second and third script, respectively). Unfortunately, the latter ones are slower than the former one.

Interestingly, I’ve found a conflict with the structures that the reduce_sum function uses as inputs when operating at multivariate settings. On the one hand, reduce_sum needs a function, such as partial_sum1 or partial_sum2, as first argument; an array, such as to_array_2d(X), as second argument; and so forth. On the other hand, normal_lpdf needs 1-dimensional arrays as first and second argument. Then, my best guess is to provide slices of \mathbf{X} as a 2-dimensional array and cast it as a 1-dimensional array as to_array_1d(to_matrix(X_slice)) and the corresponding slice of \mathbf{Z}\mathbf{W}^{t} as to_array_1d(Z[start:end]*Wt).

The funny part is that if I type to_array_1d(X_slice) instead of to_array_1d(to_matrix(X_slice)) the computational speedup is around 4 times, however, the estimates ‘don’t match’ those ones obtained by the sequential version. This is because the function to_array_1d() operates in a different manner for arrays and matrices. By the way, could someone please enlighten me about how to properly use reduce_sum when dealing with matrices? I proceeded by assuming that I can slice by rows given the model specification.

Thus, I was wondering if it’s possible to modify my code in order to get the computational speedup as when not using the function to_matrix(). I’ve tried different alternatives, yet all of them demand applying casting functions which ultimately increase the computational burden.

All the best!

Roman.

2 Likes

I think the general problem you’re going to run into here is that the work being done on the chains is very modest, just a normal log density function, which only involves quadratic functions and constants (you can make this even more efficient using normal_lupdf to drop the normalizing constants you shouldn’t need (at least I think you can call lupdf here—maybe not). Parallelizing really pays off when the communication overhead is a lot smaller than the number of operations (like with ODE solvers or even matrix multiply where you have quadratic data to communication and cubic work).

@andrewgelman wrote an entire paper about why the gamma(epsilon, epsilon) priors on precisions are problematic: http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf

With to_array_1d you have to be careful not to accidentally swap rows and columns if you have matrices and arrays. I think that’s what you were alluding to. We keep matrices in column major order and arrays in row major order. Sorry about that—I feel really bad in that I’m the one who introduced this problem into the language. I should’ve went with a uniform memory layout and a view-based system on top of that for free casting (like NumPy does or Eigen C++ does internally, for example).

Matrices are column-major, so it’s more efficient to pull out columns than rows. Either way, pulling one of these out usually does an allocation and a copy, so it’s not super expensive because there’s no autodiff overhead, but it does add memory pressure, which is problematic in parallel code.

Unless you need these in the output, I would suggest moving these to be local variables in the model block so as not to add overhead to the output, which can also be a bottleneck in cluster-based applications where disk is often slow.

This is going to be inefficient because it’s pulling rows out of Wt. Rearranging Wt to be an array of vectors rather than a matrix would allow you avoid having to do that and I don’t see anything else where Wt is used as a matrix.

Rather than converting to a 2D matrix, why not just store X as a 2D matrix in the data? It’s never used as a vector that I can see.

1 Like

Thanks a lot @Bob_Carpenter for your reply; I really appreciate your comments and suggestions.

The model I’m implementing is a Multinomial Logistic Regression in which the explanatory variables are the coordinates of the latent variables \mathbf{z}_{n}=(z_{n1},z_{n2},\ldots,z_{nM})^{t}, however, the computational bottleneck is happening when computing the likelihood of \mathbf{x}_{n}. That’s why I wanted to first address the efficiency of the implementation of a Bayesian PCA.

Thanks for this suggestion; it helped a bit.

Valuable insight about reduce_sum.

I also appreciate this advice about priors on precision parameters.

Useful information about the nuances of to_array_1d.

I did try an implementation of the aforementioned Bayesian PCA model in which \mathbf{X}^{t}=\mathbf{Z}^{t}\mathbf{W} and the slicing was column-based (normal_lupdf(to_array_1d(to_matrix(Xt_slice[:,start:end])) | to_array_1d(W*Zt[:,start:end]),sigma);) but didn’t work (incorrect estimates) as it seems that reduce_sum operates over the first dimension, that is, the slicing is row-based in this case (normal_lupdf(to_array_1d(X_slice) | to_array_1d(W*Zt[:,start:end]),sigma); produced correct estimates). Anyway, I couldn’t gain speed in a consistent manner with respect with the row-based version as it depends on the values of N, D and M.

This also worked.

Wt is used as matrix in normal_lpdf(to_array_1d(to_matrix(X_slice)) | to_array_1d(Z[start:end]*Wt),sigma); and I couldn’t figure out another way for specifying \mathbf{w}_{m} \mid \alpha_{m} \sim \mathcal{N}_{D}(\mathbf{0},\alpha_{m}^{-2}\mathbf{I}), where \mathbf{w}_{m} is the m column of \mathbf{W}.

You are right! That also helped.

Overall, I managed to reduce the computational burden of the sequential version, yet the global speedup was around 2x. More importantly, reduce_sum seems to operates (slices) over the first dimension of the array, and thus, we must need to be careful about not swapping rows and columns as you mentioned above when applying casting functions.

You can always just slice a dummy vector that is unused in the likelihood and then index into all of the important arguments to partial_sum using start:end along the appropriate dimension. I think there was a time when this would have incurred a potential efficiency penalty, but that is mostly a thing of the past. The only exception is if you have a really large (like tens of thousands of elements or more) parameter vector, in which case it is important to slice that directly. The reason for this is because parameters (unlike data) need to be updated at each leapfrog step, and passing the full updated parameter vector out to the workers gets expensive when the parameter vector is huge. The (transformed) data are immutable, so they just get passed out once no matter how big they are.

1 Like

Thanks a lot @jsocolar for your reply.

Being honest, I didn’t understand this point. Do you mind illustrating it with a simple example or providing me with pointers to material in which this is implemented?

I think I understand the logic of this point, though.

1 Like

Something like

  real partial_sum1(vector slice_dummy,
int start, int end,
array[] real X,
array[] real mu,
real sigma) {
return normal_lpdf(X[start:end] | mu[start:end],sigma);
}


where slice_dummy is a vector passed as data of the correct length (i.e. the number of elements of X or mu. Now, if X is a matrix with the appropriate number of columns, you can do X[,start:end].

1 Like

I know understand what you meant. Thanks a lot @jsocolar for the advice; I’ll definitely try it out.