Optimizing stan code runtime for matrix factorization

Garren_Hermanus · March 14, 2024, 8:07pm

Hi all

I am currently working on implementing some tensor/matrix completion framework but by factorizing the (square) matrix slices from the tensor into lower rank matrices and implementing some correlating structure between the matrix slices. That is let the tensor of partially observed entries be denoted by \boldsymbol{\mathcal{A}} \in \mathbb{R}^{T \times M \times N \times N}. We can factorize the N \times N matrix slices of \boldsymbol{\mathcal{A}} into a collection of lower rank matrices, \boldsymbol{\mathcal{U}} \in \mathbb{R}^{T \times M \times D \times N}, \boldsymbol{\mathcal{V}} \in \mathbb{R}^{T \times M \times D \times N} , arranged in a tensor form, such that:

\boldsymbol{A}_{t,m,:.:}= \boldsymbol{U}_{t,m,:,:}^{T} \boldsymbol{V}_{t,m,:.:} + \varepsilon

We can look at the dimensions t, m as being accompanied with some input data, lets denote this by \boldsymbol{x}_{a,T}, \boldsymbol{x}_{a,M}. We want to introduce dependencies along the entries at positions (row and column) ij and ji across all the matrices by using some GP model. That is we denote the vector of entries extracted across all t, m at positions ij, ji as \boldsymbol{a}_{ij} and \boldsymbol{a}_{ji}. We combine these into a single vector such that \boldsymbol{a}^{(ij)}=[\boldsymbol{a}_{ij}, \boldsymbol{a}_{ji}].

Now for the modelling part. Suppose we have some priors on \boldsymbol{\mathcal{U}} and \boldsymbol{\mathcal{V}}, P(\boldsymbol{\mathcal{U}}), P(\boldsymbol{\mathcal{V}}). And we derived some prior for structure which incorporates the belief that the entries in \boldsymbol{\mathcal{A}} is from a GP and from the matrix factorization, that is prior densities specified as P(\boldsymbol{a}^{(ij)}|\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}}). The probability density of all the reconstructed entries can be given as the product of the individual ij's, that is:

P(\boldsymbol{\mathcal{A}}|\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}}) = \prod_{i=1}^{N} {\prod_{j=i+1}^{N} {P(\boldsymbol{a}^{(ij)}|\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}})} }

We can now relate these reconstructed entries to the data we have (likelihood function). Firstly, the vector \boldsymbol{a}^{(ij)} as mentioned has some associated input data. We have some experimental data for ij which is denoted by \boldsymbol{y}^{(ij)} which has input data \boldsymbol{x}_{y,T}^{(ij)}, \boldsymbol{x}_{y,M}^{(ij)}. That is we can look at \boldsymbol{a}^{(ij)} as being interpolated data from the experimental data \boldsymbol{y}^{(ij)} which we use to learn the lower rank matrices \boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}} but in a single framework. The likelihood can thus be described by the predictive equations of a GP, P(\boldsymbol{y}^{(ij)}|\boldsymbol{a}^{(ij)}). Or across all the known dataset:

P(\boldsymbol{y}|\boldsymbol{\mathcal{A}}) = \prod_{i=1}^{N} {\prod_{j=i+1}^{N} { {(P(\boldsymbol{y}^{(ij)}|\boldsymbol{a}^{(ij)}))^{I_{ij}}} }}

Where: I_{ij} is an indicator function equals to 1 if dataset ij is known, zero otherwise. The full bayesian model is thus:

P(\boldsymbol{\mathcal{A}}, \boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}}|\boldsymbol{y}) \propto P(\boldsymbol{y}|\boldsymbol{\mathcal{A}}) P(\boldsymbol{\mathcal{A}}|\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}}) P(\boldsymbol{\mathcal{U}}) P(\boldsymbol{\mathcal{V}})

I have the following stan for this implementation for this code

functions {
    // kernel for composition
    matrix Kx(vector x1, vector x2, int order) {
        int N = rows(x1);
        int M = rows(x2);
        matrix[N, order+1] X1;
        matrix[M, order+1] X2;
        for (i in 1:order) {
            X1[:,i] = x1 .^(order+2-i) - x1;
            X2[:,i] = x2 .^(order+2-i) - x2;
        }
        X1[:,order+1] = 1e-1 * x1 .* sqrt(1-x1) .* exp(x1);
        X2[:,order+1] = 1e-1 * x2 .* sqrt(1-x2) .* exp(x2);
        return X1 * X2';
    }

    // kernel for Temperature
    matrix KT(vector T1, vector T2) {
        int N = rows(T1);
        int M = rows(T2);
        matrix[N, 4] TT1 = append_col(append_col(append_col(rep_vector(1.0, N), T1), 1e-2* T1.^2), 1e-4* T1.^3);
        matrix[M, 4] TT2 = append_col(append_col(append_col(rep_vector(1.0, M), T2), 1e-2* T2.^2), 1e-4* T2.^3);

        return TT1 * TT2';
    }

    // Combined kernel
    matrix K(vector x1, vector x2, vector T1, vector T2, int order) {
        return Kx(x1, x2, order) .* KT(T1, T2); 
    }

    // Reduce sum for parallel computing of log densities
    // Scaled Feature matrices priors
    real ps_F_raw(array[] int M_slice, int start, int end, array[] matrix F_raw) {
        real  F_raw_target = 0;
        int M = num_elements(M_slice); // number of a_MC in the parallel space
        for (m in 1:M) {
            F_raw_target += std_normal_lpdf(to_vector(F_raw[2*M_slice[m]-1,:,:])); // Prior for scaled U matrices
            F_raw_target += std_normal_lpdf(to_vector(F_raw[2*M_slice[m],:,:])); // Prior for scaled V matrices
        }
        return F_raw_target;
    }

    // Likelihood and smoothed excess enthalpy entries
    real ps_a_MC_y(array[] int N_slice, int start, int end, matrix y_MC, array[,] matrix F_raw, vector v_ARD, 
        int N_known, int N_unknown, int N_MC, int N_T, int M, array[,] int Idx_all, 
        array[] int N_points, vector y1, matrix K_y, vector v_D, matrix K_MC, vector x1, 
        vector T1, vector x2, vector T2, real error, int order, matrix L_MC) {
            int N = num_elements(N_slice); // number of a_MC in the parallel space
            matrix[N, N_MC] a_MC;
            real all_target = 0;

            for (i in 1:N) {
                int counter = 1;
                for (t in 1:N_T) {
                    // composition up until just before 0.5
                    for (m in 1:M-1) {
                        a_MC[i, counter] = F_raw[t,m*2-1,:,Idx_all[N_slice[i],1]]' * diag_matrix(v_ARD) * F_raw[t,m*2,:,Idx_all[N_slice[i],2]];
                        counter += 1;
                    }
                    // composition of 0.5
                    a_MC[i, counter] = F_raw[t,2*M-1,:,Idx_all[N_slice[i],1]]' * diag_matrix(v_ARD) * F_raw[t,2*M-1,:,Idx_all[N_slice[i],2]];
                    counter += 1;
                    // composition from just after 0.5
                    for (m in 1:M-1) {
                        a_MC[i, counter] = F_raw[t,(M-m)*2-1,:,Idx_all[N_slice[i],2]]' * diag_matrix(v_ARD) * F_raw[t,(M-m)*2,:,Idx_all[N_slice[i],1]];
                        counter += 1;
                    }
                }

                all_target += multi_normal_cholesky_lpdf( y_MC[N_slice[i], :]' | a_MC[i,:]', L_MC);

                if (N_slice[i]<=N_known) {
                    matrix[N_points[N_slice[i]], N_MC] K_y_MC = K(x1[sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]])], x2, T1[sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]])], T2, order); // GP kernel for the cross covaraince
                    matrix[N_points[N_slice[i]], N_MC] stable_cov_y = mdivide_right(K_y_MC, (cholesky_decompose(K_MC))') ; // stable decomposition the K_(y-MC)*K_(MC)^-1*K_(y-MC)^T
                    matrix[N_points[N_slice[i]], N_points[N_slice[i]]] L_y = cholesky_decompose(K_y[sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]]), sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]])] + diag_matrix(error*abs(y1[sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]])])+v_D[N_slice[i]]) - stable_cov_y*stable_cov_y'); //covariance for the likelihood function
                    vector[N_points[N_slice[i]]] mean_y = mdivide_right_spd(K_y_MC, K_MC) * y_MC[N_slice[i], :]'; //mean for the likleihood function
                    all_target += multi_normal_cholesky_lpdf(y1[sum(N_points[:N_slice[i]-1])+1:sum(N_points[:N_slice[i]])] | mean_y, L_y);
                }
            }
            return all_target;
    }
    
}

data {
    int N_known; // number of known mixtures
    int N_unknown; // number of unknown mixtures
    array[N_known] int N_points; // number of experimental points per known dataset
    int order; // order of the compositional polynomial
    vector[sum(N_points)] x1; // experimental composition
    vector[sum(N_points)] T1; // experimental temperatures
    vector[sum(N_points)] y1; // experimental excess enthalpy
    int N_T; // number of interpolated temperatures
    int N_C; // number of interpolated compositions
    vector[N_T] T2_int; // unique temperatures to interpolate
    vector[N_C] x2_int; // unique compositions to interpolate

    int N; // number of components
    int D; // rank of feature matrices
    array[N_known, 2] int Idx_known; // indices (row, column) of known datasets
    array[N_unknown, 2] int Idx_unknown; // indices (row, column) of unknown datasets 
    vector<lower=0>[N_known+N_unknown] v_D; // data-model mistamatch error
    real<lower=0> v_MC; // error between smoothed y_MC values and reconstructed entries

    real<lower=0> jitter; // jitter to stabalize inverse calculations
}

transformed data {
    real error=0.01;
    int M = (N_C + 1) %/% 2; // interger division to get the number of U matrices
    int N_MC = N_C*N_T; // overall number of interpolated datapoints per dataset
    vector[N_MC] x2; // concatnated vector of x2_int
    vector[N_MC] T2; // concatenated vector of T2_int
    matrix[N_MC, N_MC] K_MC; // kernel for the interpolated data
    matrix[sum(N_points), sum(N_points)] K_y; // kernel for the experimental data
    array[N_known+N_unknown] int N_slice;
    array[M-1] int M_slice;
    matrix[N_MC, N_MC] K_MC_inv;
    matrix[N_MC, N_MC] cov_y_MC;
    matrix[N_MC, N_MC] L_MC; // cholesky decomposition of the covariance of y_MC

    for (i in 1:N_known+N_unknown) {
        N_slice[i] = i;
    }

    for (i in 1:M-1) {
        M_slice[i] = i;
    }

    // Assign MC input vectors for temperature and composition
    for (i in 1:N_T) {
        x2[(i-1)*N_C+1:i*N_C] = x2_int;
        T2[(i-1)*N_C+1:i*N_C] = rep_vector(T2_int[i],N_C);
    } 

    // Assign MC kernel
    K_MC = K(x2, x2, T2, T2, order) + diag_matrix(rep_vector(jitter, N_MC));

    // Assign experimental data kernel
    for (i in 1:N_known) {
        K_y[sum(N_points[:i-1])+1:sum(N_points[:i]), sum(N_points[:i-1])+1:sum(N_points[:i])] = K(x1[sum(N_points[:i-1])+1:sum(N_points[:i])], x1[sum(N_points[:i-1])+1:sum(N_points[:i])], T1[sum(N_points[:i-1])+1:sum(N_points[:i])], T1[sum(N_points[:i-1])+1:sum(N_points[:i])], order);
    }


    // Compute cholesky decomposition of the cov of y_MC
    K_MC_inv = inverse_spd(K_MC);
    K_MC_inv = (K_MC_inv+K_MC_inv')/2; //ensure symmetricness
    cov_y_MC = inverse_spd(K_MC_inv+ diag_matrix(rep_vector(v_MC^-1,N_MC)));
    L_MC = cholesky_decompose(cov_y_MC);
}

parameters {
    // MC parameters
    array[N_T, M*2-1] matrix[D,N] F_raw; // scaled feature matrices; M U matrices, and M-1 V matrices
    vector<lower=0>[D] v_ARD; // variance of the ARD framework
    real<lower=0> scale; // scale parameter dictating strenght of ARD effect 

    // Smoothed GP parameters
    matrix[N_known+N_unknown, N_MC] y_MC; // smoothed excess enthalpy values

    // Data parameters
    //vector<lower=0>[N_known+N_unknown] v_D; // variance for the data-model mismatch.
}

model {
    int grainsize_F_raw = 1;
    int grainsize_y_MC = 1;

    // prior on scale
    scale ~ gamma(2,1);

    // prior on v_ARD
    v_ARD ~ exponential(scale);

    // Prior on the data-model mismatch
    //v_D ~ exponential(1);

    // ARD priors for feature matrices
    for (t in 1:N_T) {
        target += reduce_sum(ps_F_raw, M_slice, grainsize_F_raw, F_raw[t,:2*M-1]);
        to_vector(F_raw[t,2*M-1,:,:]) ~ std_normal(); // Prior for scaled U matrix at x=0.5
    }

    target +=  reduce_sum(ps_a_MC_y, N_slice, grainsize_y_MC, y_MC, F_raw, v_ARD, 
        N_known, N_unknown, N_MC, N_T, M, append_array(Idx_known, Idx_unknown), 
        N_points, y1, K_y, v_D, K_MC, x1, 
        T1, x2, T2, error, order, L_MC);
}

I wanted to know if there is any way I can optimize this code further?
I have used the reduce_sum functions to specify the priors and likelihood on the model as in the code provided. I have tested this code and found that for the normal code (using just for loops and not reduce_sum) optimization took 30min - 1 hour, which has been reduced to about 14 min per 1000 optimization iterations. I have played around by changing the number of stan threads used from 4 up until 7 but this does not seem to have any influence on the run time (I have an 8 core machine but will run this code on a cluster with significantly more resources, but I do not want to use more cores if it will not help with faster runtimes). Does this occur due to the overhead costs being significantly more when we use more cores? I have also tried to change the grainsize of the 2 reduce_sum functions but these also did not seem to have a significant effect on the runtime. Any suggestions would be helpful.
Also with the normal code I ran (no reduce_sum) the code took about 2.5 GB of ram but with the new model it takes about 600 MB. This is something I did not expect to happen and the code is being ran in parallel thus it should’ve used more resources? But then again in my original code I specified a matrix in the model block matrix[N_known+N_unknown, N_MC] a_MC; which probably lead to the reduction in memory usage as I now specify this matrix in the reduce_sum function, as an matrix[num_elements(N_slice), N_MC] a_MC; which could lead to the reduction in memory usage?

For reference for the current dataset int N_MC = 57, int N_known = 102, int N_unknown = 58, int N=17, int D=10, int N_T = 3, int N_C=19, N_points(number of experimental datapoints per dataset ij) ranges from 7-100. These values will becomes larger when I consider more data.

Also I am using cmdstanpy 1.2.1 and cmdstan 2.34.0

martinmodrak · March 18, 2024, 1:46pm

Hi,
I only have partial answers as I don’t fully understand the model, but since nobody else answered this, I’ll give it a go.

Just to state the possibly obvious: Stan performance is heavily influenced by the geometry of the posterior. Improving the geometry can thus lead to substantial speedups (because you reduce treedepth necessary to reach U-turn, so you compute less steps per iteration). So checking some pairs plots etc. to detect signs of possibly problematic geometry is likely a good time investment.

I also can’t imagine debugging such a huge model, so I would definitely try decomposing it into a set of smaller components that can be tested and optimized separately and combined them only once the components work good in isolation (e.g. with SBC, following the Small model implementation workflow • SBC)

One possible optimizationin would be to use Hilbert-space basis approximation for the Gaussian process - see Gaussian process demonstration with Stan but note that you’ll have to derive the basis specifically for your kernel of interest.

There might be tiny bits of improvement in using compound opeartion functions like add_diag or dot_self instead of raw matrix operations whenever possible. I’d say that constructs like X' * diag_matrix(v_ARD) * Y should be equivalent to X' * Y * v_ARD - which would be a bit faster if I am right.

You are also repeating the computation of cholesky_decompose(K_MC) within a loop in ps_a_MC_y.

I think (not 100% sure) that if you are doing cholesky_decompose yourself on the covariance matrix, you may as well just run multi_normal directly with the full matrix and this might be a bit faster (at least is not gonna save the decomposition on the autodiff stack).

Garren_Hermanus:

I have used the reduce_sum functions to specify the priors and likelihood on the model as in the code provided. I have tested this code and found that for the normal code (using just for loops and not reduce_sum) optimization took 30min - 1 hour, which has been reduced to about 14 min per 1000 optimization iterations. I have played around by changing the number of stan threads used from 4 up until 7 but this does not seem to have any influence on the run time (I have an 8 core machine but will run this code on a cluster with significantly more resources, but I do not want to use more cores if it will not help with faster runtimes). Does this occur due to the overhead costs being significantly more when we use more cores? I have also tried to change the grainsize of the 2 reduce_sum functions but these also did not seem to have a significant effect on the runtime. Any suggestions would be helpful.

Within-chain paralellization tends to have substantial overhead in moving values between threads. So yes, increasing the number of threads available may quickly reach diminishing returns. In that sense your situation appears typical.

This is somewhat expected - the difference is that with reduce_sum, the backward pass of the autodiff is computed separately for each slice and then only the final value and its gradients are passed back to the main thread. Without reduce_sum, Stan will build the autodiff stack for the whole model and then do a single reverse pass once the model block completes. For this reason, people sometimes report reduced memory usage (and even performance gains due to e.g. better memory locality) when using reduce_sum even when just a single core is used.

Hope that helps you move forward!

Garren_Hermanus · March 18, 2024, 2:37pm

Thanks for the response @martinmodrak it is much appreciated!

I have tested the model out and it works only when we provide initial values for \boldsymbol{\mathcal{U}} and \boldsymbol{\mathcal{V}} such that \boldsymbol{a}^{(ij)}\approx (\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}})_{(ij)} where (\boldsymbol{\mathcal{U}},\boldsymbol{\mathcal{V}})_{(ij)} is the product of the the matrix factorization where we only extract entries across ij and ji. When these are not provided the model tends to learn \boldsymbol{\mathcal{U}} and \boldsymbol{\mathcal{V}} to be zeros such that \boldsymbol{a}^{(ij)} is only the interpolated data from the experimental values. This would indicate to me that the posterior is most likely multimodal, and given that the matrix factorization is unidentifiable contributes more towards modelling difficulties. But I will check this post and see if it might be useful for the application of my model.

I’ll have a look at the documentation for this again, thanks.

I do not think this would be valid since X, Y are both D dimensional vectors, or if we consider the full matrices these are N\times D dimensional matrices and diag_matrix(v_ARD) is a D\times D diagonal matrix. That is the latent dimensions of Y and X is scaled by v_ARD, similar to SVD. A similar construct would probably be to dot multiple and then dot product. That is dot_product(X.*v_ARD, Y). I’ll test this out and see if it makes a difference or not.

Thanks, I made some adjustments to the code where previously I incorporated noise into the kernel matrix.

I followed the user guide on this, which stated that using the multi_normal_cholesky is faster than multi_normal. This may have been an outdated version of the documentation which I read as I do not see this in the current documentation. I’ll see if this makes much of a difference.

I see, thanks. I’ll just stick to 4 cores for this

Thanks again for your response!

martinmodrak · March 18, 2024, 8:42pm

You are correct, I did think that v_ARD is a scalar, but that was a misunderstanding.

Yes, multimodality/non-identifiability can hurt your performance substantially.

Good luck going forward.

mitzimorris · March 19, 2024, 4:33pm

not an optimization, but since you’re using CmdStanPy, have you tried running Pathfinder first to improve the inits?

example here: Variational Inference using Pathfinder — CmdStanPy 1.2.1 documentation

Garren_Hermanus · March 19, 2024, 4:46pm

I did not know something like this existed. I’ll have a look at this. I have tried a bunch of different initialization such as running MAP and VI from random points and using the mean from VI or optimized params from MAP. I do not know if this is a similar approach to what pathfinder is?

Garren_Hermanus · May 18, 2024, 3:07am

Coming back to this, I have made an intermediate reparameterization.
Changing the following multivariate normal to a standard normal using the cholesky decomposition.

The speed up times I am experiencing is really significant. This initially took ±2 hours to draw 200 samples (100 warmup + 100 samples). After the reparameterization it takes ±1 hour to draw 2000 samples (1000 warmup + 1000 samples). I am not specifying the original parameter in the transformed parameters block, but rather using the raw parameter in the partial_sum likelihood function.

Q: I just want to know why this speedup is so significant when moving to a non-cenetered case?

Topic		Replies	Views
Rstan much slower than self-coded MCMC Modeling performance	10	2401	January 14, 2022
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3156	April 21, 2020
Speed-up for Mixed Discrete-Continuous Gaussian Copula - Reduce Sum? Modeling techniques , fitting-issues , specification , performance , cmdstanr	11	968	February 1, 2023
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3439	March 5, 2018
Speeding up hierarchical models Developers	26	2497	June 5, 2020

Optimizing stan code runtime for matrix factorization

Related topics