Hi all
I am currently working on implementing some tensor/matrix completion framework but by factorizing the (square) matrix slices from the tensor into lower rank matrices and implementing some correlating structure between the matrix slices. That is let the tensor of partially observed entries be denoted by \boldsymbol{\mathcal{A}} \in \mathbb{R}^{T \times M \times N \times N}. We can factorize the N \times N matrix slices of \boldsymbol{\mathcal{A}} into a collection of lower rank matrices, \boldsymbol{\mathcal{U}} \in \mathbb{R}^{T \times M \times D \times N}, \boldsymbol{\mathcal{V}} \in \mathbb{R}^{T \times M \times D \times N} , arranged in a tensor form, such that:
We can look at the dimensions t, m as being accompanied with some input data, lets denote this by \boldsymbol{x}_{a,T}, \boldsymbol{x}_{a,M}. We want to introduce dependencies along the entries at positions (row and column) ij and ji across all the matrices by using some GP model. That is we denote the vector of entries extracted across all t, m at positions ij, ji as \boldsymbol{a}_{ij} and \boldsymbol{a}_{ji}. We combine these into a single vector such that \boldsymbol{a}^{(ij)}=[\boldsymbol{a}_{ij}, \boldsymbol{a}_{ji}].
Now for the modelling part. Suppose we have some priors on \boldsymbol{\mathcal{U}} and \boldsymbol{\mathcal{V}}, P(\boldsymbol{\mathcal{U}}), P(\boldsymbol{\mathcal{V}}). And we derived some prior for structure which incorporates the belief that the entries in \boldsymbol{\mathcal{A}} is from a GP and from the matrix factorization, that is prior densities specified as P(\boldsymbol{a}^{(ij)}\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}}). The probability density of all the reconstructed entries can be given as the product of the individual ij's, that is:
We can now relate these reconstructed entries to the data we have (likelihood function). Firstly, the vector \boldsymbol{a}^{(ij)} as mentioned has some associated input data. We have some experimental data for ij which is denoted by \boldsymbol{y}^{(ij)} which has input data \boldsymbol{x}_{y,T}^{(ij)}, \boldsymbol{x}_{y,M}^{(ij)}. That is we can look at \boldsymbol{a}^{(ij)} as being interpolated data from the experimental data \boldsymbol{y}^{(ij)} which we use to learn the lower rank matrices \boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}} but in a single framework. The likelihood can thus be described by the predictive equations of a GP, P(\boldsymbol{y}^{(ij)}\boldsymbol{a}^{(ij)}). Or across all the known dataset:
Where: I_{ij} is an indicator function equals to 1 if dataset ij is known, zero otherwise. The full bayesian model is thus:
I have the following stan for this implementation for this code
functions {
// kernel for composition
matrix Kx(vector x1, vector x2, int order) {
int N = rows(x1);
int M = rows(x2);
matrix[N, order+1] X1;
matrix[M, order+1] X2;
for (i in 1:order) {
X1[:,i] = x1 .^(order+2i)  x1;
X2[:,i] = x2 .^(order+2i)  x2;
}
X1[:,order+1] = 1e1 * x1 .* sqrt(1x1) .* exp(x1);
X2[:,order+1] = 1e1 * x2 .* sqrt(1x2) .* exp(x2);
return X1 * X2';
}
// kernel for Temperature
matrix KT(vector T1, vector T2) {
int N = rows(T1);
int M = rows(T2);
matrix[N, 4] TT1 = append_col(append_col(append_col(rep_vector(1.0, N), T1), 1e2* T1.^2), 1e4* T1.^3);
matrix[M, 4] TT2 = append_col(append_col(append_col(rep_vector(1.0, M), T2), 1e2* T2.^2), 1e4* T2.^3);
return TT1 * TT2';
}
// Combined kernel
matrix K(vector x1, vector x2, vector T1, vector T2, int order) {
return Kx(x1, x2, order) .* KT(T1, T2);
}
// Reduce sum for parallel computing of log densities
// Scaled Feature matrices priors
real ps_F_raw(array[] int M_slice, int start, int end, array[] matrix F_raw) {
real F_raw_target = 0;
int M = num_elements(M_slice); // number of a_MC in the parallel space
for (m in 1:M) {
F_raw_target += std_normal_lpdf(to_vector(F_raw[2*M_slice[m]1,:,:])); // Prior for scaled U matrices
F_raw_target += std_normal_lpdf(to_vector(F_raw[2*M_slice[m],:,:])); // Prior for scaled V matrices
}
return F_raw_target;
}
// Likelihood and smoothed excess enthalpy entries
real ps_a_MC_y(array[] int N_slice, int start, int end, matrix y_MC, array[,] matrix F_raw, vector v_ARD,
int N_known, int N_unknown, int N_MC, int N_T, int M, array[,] int Idx_all,
array[] int N_points, vector y1, matrix K_y, vector v_D, matrix K_MC, vector x1,
vector T1, vector x2, vector T2, real error, int order, matrix L_MC) {
int N = num_elements(N_slice); // number of a_MC in the parallel space
matrix[N, N_MC] a_MC;
real all_target = 0;
for (i in 1:N) {
int counter = 1;
for (t in 1:N_T) {
// composition up until just before 0.5
for (m in 1:M1) {
a_MC[i, counter] = F_raw[t,m*21,:,Idx_all[N_slice[i],1]]' * diag_matrix(v_ARD) * F_raw[t,m*2,:,Idx_all[N_slice[i],2]];
counter += 1;
}
// composition of 0.5
a_MC[i, counter] = F_raw[t,2*M1,:,Idx_all[N_slice[i],1]]' * diag_matrix(v_ARD) * F_raw[t,2*M1,:,Idx_all[N_slice[i],2]];
counter += 1;
// composition from just after 0.5
for (m in 1:M1) {
a_MC[i, counter] = F_raw[t,(Mm)*21,:,Idx_all[N_slice[i],2]]' * diag_matrix(v_ARD) * F_raw[t,(Mm)*2,:,Idx_all[N_slice[i],1]];
counter += 1;
}
}
all_target += multi_normal_cholesky_lpdf( y_MC[N_slice[i], :]'  a_MC[i,:]', L_MC);
if (N_slice[i]<=N_known) {
matrix[N_points[N_slice[i]], N_MC] K_y_MC = K(x1[sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]])], x2, T1[sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]])], T2, order); // GP kernel for the cross covaraince
matrix[N_points[N_slice[i]], N_MC] stable_cov_y = mdivide_right(K_y_MC, (cholesky_decompose(K_MC))') ; // stable decomposition the K_(yMC)*K_(MC)^1*K_(yMC)^T
matrix[N_points[N_slice[i]], N_points[N_slice[i]]] L_y = cholesky_decompose(K_y[sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]]), sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]])] + diag_matrix(error*abs(y1[sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]])])+v_D[N_slice[i]])  stable_cov_y*stable_cov_y'); //covariance for the likelihood function
vector[N_points[N_slice[i]]] mean_y = mdivide_right_spd(K_y_MC, K_MC) * y_MC[N_slice[i], :]'; //mean for the likleihood function
all_target += multi_normal_cholesky_lpdf(y1[sum(N_points[:N_slice[i]1])+1:sum(N_points[:N_slice[i]])]  mean_y, L_y);
}
}
return all_target;
}
}
data {
int N_known; // number of known mixtures
int N_unknown; // number of unknown mixtures
array[N_known] int N_points; // number of experimental points per known dataset
int order; // order of the compositional polynomial
vector[sum(N_points)] x1; // experimental composition
vector[sum(N_points)] T1; // experimental temperatures
vector[sum(N_points)] y1; // experimental excess enthalpy
int N_T; // number of interpolated temperatures
int N_C; // number of interpolated compositions
vector[N_T] T2_int; // unique temperatures to interpolate
vector[N_C] x2_int; // unique compositions to interpolate
int N; // number of components
int D; // rank of feature matrices
array[N_known, 2] int Idx_known; // indices (row, column) of known datasets
array[N_unknown, 2] int Idx_unknown; // indices (row, column) of unknown datasets
vector<lower=0>[N_known+N_unknown] v_D; // datamodel mistamatch error
real<lower=0> v_MC; // error between smoothed y_MC values and reconstructed entries
real<lower=0> jitter; // jitter to stabalize inverse calculations
}
transformed data {
real error=0.01;
int M = (N_C + 1) %/% 2; // interger division to get the number of U matrices
int N_MC = N_C*N_T; // overall number of interpolated datapoints per dataset
vector[N_MC] x2; // concatnated vector of x2_int
vector[N_MC] T2; // concatenated vector of T2_int
matrix[N_MC, N_MC] K_MC; // kernel for the interpolated data
matrix[sum(N_points), sum(N_points)] K_y; // kernel for the experimental data
array[N_known+N_unknown] int N_slice;
array[M1] int M_slice;
matrix[N_MC, N_MC] K_MC_inv;
matrix[N_MC, N_MC] cov_y_MC;
matrix[N_MC, N_MC] L_MC; // cholesky decomposition of the covariance of y_MC
for (i in 1:N_known+N_unknown) {
N_slice[i] = i;
}
for (i in 1:M1) {
M_slice[i] = i;
}
// Assign MC input vectors for temperature and composition
for (i in 1:N_T) {
x2[(i1)*N_C+1:i*N_C] = x2_int;
T2[(i1)*N_C+1:i*N_C] = rep_vector(T2_int[i],N_C);
}
// Assign MC kernel
K_MC = K(x2, x2, T2, T2, order) + diag_matrix(rep_vector(jitter, N_MC));
// Assign experimental data kernel
for (i in 1:N_known) {
K_y[sum(N_points[:i1])+1:sum(N_points[:i]), sum(N_points[:i1])+1:sum(N_points[:i])] = K(x1[sum(N_points[:i1])+1:sum(N_points[:i])], x1[sum(N_points[:i1])+1:sum(N_points[:i])], T1[sum(N_points[:i1])+1:sum(N_points[:i])], T1[sum(N_points[:i1])+1:sum(N_points[:i])], order);
}
// Compute cholesky decomposition of the cov of y_MC
K_MC_inv = inverse_spd(K_MC);
K_MC_inv = (K_MC_inv+K_MC_inv')/2; //ensure symmetricness
cov_y_MC = inverse_spd(K_MC_inv+ diag_matrix(rep_vector(v_MC^1,N_MC)));
L_MC = cholesky_decompose(cov_y_MC);
}
parameters {
// MC parameters
array[N_T, M*21] matrix[D,N] F_raw; // scaled feature matrices; M U matrices, and M1 V matrices
vector<lower=0>[D] v_ARD; // variance of the ARD framework
real<lower=0> scale; // scale parameter dictating strenght of ARD effect
// Smoothed GP parameters
matrix[N_known+N_unknown, N_MC] y_MC; // smoothed excess enthalpy values
// Data parameters
//vector<lower=0>[N_known+N_unknown] v_D; // variance for the datamodel mismatch.
}
model {
int grainsize_F_raw = 1;
int grainsize_y_MC = 1;
// prior on scale
scale ~ gamma(2,1);
// prior on v_ARD
v_ARD ~ exponential(scale);
// Prior on the datamodel mismatch
//v_D ~ exponential(1);
// ARD priors for feature matrices
for (t in 1:N_T) {
target += reduce_sum(ps_F_raw, M_slice, grainsize_F_raw, F_raw[t,:2*M1]);
to_vector(F_raw[t,2*M1,:,:]) ~ std_normal(); // Prior for scaled U matrix at x=0.5
}
target += reduce_sum(ps_a_MC_y, N_slice, grainsize_y_MC, y_MC, F_raw, v_ARD,
N_known, N_unknown, N_MC, N_T, M, append_array(Idx_known, Idx_unknown),
N_points, y1, K_y, v_D, K_MC, x1,
T1, x2, T2, error, order, L_MC);
}

I wanted to know if there is any way I can optimize this code further?

I have used the
reduce_sum
functions to specify the priors and likelihood on the model as in the code provided. I have tested this code and found that for the normal code (using just for loops and notreduce_sum
) optimization took 30min  1 hour, which has been reduced to about 14 min per 1000 optimization iterations. I have played around by changing the number of stan threads used from 4 up until 7 but this does not seem to have any influence on the run time (I have an 8 core machine but will run this code on a cluster with significantly more resources, but I do not want to use more cores if it will not help with faster runtimes). Does this occur due to the overhead costs being significantly more when we use more cores? I have also tried to change the grainsize of the 2reduce_sum
functions but these also did not seem to have a significant effect on the runtime. Any suggestions would be helpful. 
Also with the normal code I ran (no
reduce_sum
) the code took about 2.5 GB of ram but with the new model it takes about 600 MB. This is something I did not expect to happen and the code is being ran in parallel thus it should’ve used more resources? But then again in my original code I specified a matrix in the model blockmatrix[N_known+N_unknown, N_MC] a_MC;
which probably lead to the reduction in memory usage as I now specify this matrix in thereduce_sum
function, as anmatrix[num_elements(N_slice), N_MC] a_MC
; which could lead to the reduction in memory usage?
For reference for the current dataset int N_MC = 57
, int N_known = 102
, int N_unknown = 58
, int N=17
, int D=10
, int N_T = 3
, int N_C=19
, N_points
(number of experimental datapoints per dataset ij) ranges from 7100. These values will becomes larger when I consider more data.
Also I am using cmdstanpy 1.2.1 and cmdstan 2.34.0