Possibly improving the performance of Mulivariantnormal Distribution

I run the following stan program as a parameter estimation task for a three factor model for market prices.

	int N_days;
	int N_markets;
	int N_factors;
	int minT;
	int N_tenors;
	vector[N_markets*N_tenors] y[N_days];
	real mu;
	real nu;

transformed data
	vector[N_markets*N_tenors] mu_vec;
	for (i in 1:(N_markets*N_tenors)) mu_vec[i] = mu;
	real<lower=0, upper=10> a[N_markets];
	real<lower=0, upper=10> b[N_markets];
	real<lower=0, upper=10> c[N_markets];
	real<lower=0.05, upper=0.2> d[N_markets];
    // independent noise
	vector<lower=0>[N_markets] sigma;
	cholesky_factor_corr[N_factors*N_markets] L;
transformed parameters
	vector[N_markets] sigma2;
	vector[N_markets*N_tenors] sigma2_vec;
	sigma2=sigma .* sigma;
	for (i in 1:N_tenors)
  matrix[N_markets*N_tenors,N_markets*N_factors] Z;
  matrix[N_markets*N_tenors,N_markets*N_tenors] LL;
  real z;

	Z = rep_matrix(0,N_markets*N_tenors,N_markets*N_factors);
	for (i in 1:N_tenors)
		for (u in 1:N_markets)
			// this is adapted to the 3-factor model.
			Z[(i-1)*N_markets+u, (u-1)*N_factors+1] = z;
			Z[(i-1)*N_markets+u, (u-1)*N_factors+2] = sqrt(z)*c[u];
			Z[(i-1)*N_markets+u, (u-1)*N_factors+3] = d[u];

	LL = cholesky_decompose(add_diag(Z*multiply_lower_tri_self_transpose(L)*Z', sigma2_vec)/252.0);

	L ~ lkj_corr_cholesky(nu);

	y ~ multi_normal_cholesky(mu_vec, LL);

generated quantities
	matrix[N_factors*N_markets, N_factors*N_markets] Cov;
	Cov = multiply_lower_tri_self_transpose(L); //L*L'

For small dimensions, the performance is reasonably good with a run time of around 10 minutes. With small dimensions, I mean N_days=625, N_factors=3, N_markets=2 and N_tenors=48 resulting in a cholesky matrix L\in \mathbb{R}^{n\times n} with n=6.

However, the program scales really badly. For N_markets=13 and a resulting cholesky matrix with n=39, the program takes over a week to finish, respectively, it does not finish at all.

I was wondering if there is a way to change the structure of the code to improve the performance for large dimensions or if this scaling behavior is to be expected and there is not much I can do?

Unfortunately, I cannot provide theactual dataset but if need be, I should be able to patch together a dummy dataset.

1 Like

Hi Ueliwechsler,

The multi_normal_cholesky distribution is actually pretty heavily optimised, so there’s not much more to be done there.

Some generic advice for large-dimensional models, especially those involving matrix multiplications and cholesky decompositions is to enable GPU support and use map_rect for parallelisation. Enabling GPU support is a little easier and just requires some setup (easy-to-follow guide here). Using map_rect can be a bit more complex, and requires modifying the actual Stan code (more information in the User’s Guide).

There are some things you can try with your model parameterisation that could improve the speed as well. At the moment, all of your parameters (aside from L) are using uniform prior distributions. Depending on the geometry of the resulting posteriors, these could be difficult for Stan to explore. I’d suggest trying a prior with a non-centered parameterisation (more info here).

Using a prior for sigma could be especially helpful, as the current prior assumes all values from 0 to infinity are equally likely.

For this multiplication:


Stan has optimised functions for some matrix products, this could be specified as:


Note the transpose operator ' on Z

I’d also suggest switching the order of these statements:

	LL = cholesky_decompose(add_diag(Z*multiply_lower_tri_self_transpose(L)*Z', sigma2_vec)/252.0);

	L ~ lkj_corr_cholesky(nu);

I’m not 100% sure it will make a difference, but the current parameterisation implies that you use the parameter L before you specify a prior for it.

Hope some of that helps!


Thank you very much for your help and the suggestions.

We tried running the code on GPU and reformulating the expressions according to your proposal.
Unfortunatelly, this did not improve the performance of the stan model significantly.

We ended up restructuring the entire model to avoid computations with one huge cholesky matrix. This sequential definition of the model did the trick for us. We could speed up the computation such that it now takes around one day instead of weeks.