Speedups for Wishart model

Hello,

I have a spatial population genetic clustering method in which samples are assigned to, or admixed between, a user-specified number of discrete groups within which relatedness decays with distance. Each group is modeled as a spatial Gaussian process, so the full model is a mixture of spatial Gaussian processes. The data can consist of between tens of thousands to millions of independent observations, so, for efficiency, I summarize with the observed covariance between samples across observations, and model that covariance as a draw from a Wishart with a parametric scale matrix (detailed in the model block below). Implemented in STAN, the model works great for a smallish number of samples (fewer than ~100), but itā€™s quite slow for larger datasets.

I have two questions:

  1. Are there any low-hanging fruit speedups that you can see, either in reworking the syntax or reparameterizations? I know the STAN crew is generally anti-Wishart as a prior, but, because there are so many observations (potentially into the millions), Iā€™m think that the Wishart is more efficient than the multivariate Gaussian for calculating the likelihood.

  2. Are there ways to implement rank-one updates to the inverse of the Wishartā€™s parametric scale matrix in STAN? For some parameters, if I were to update only that parameter, I could use the Sherman-Morrison-Woodbury trick to update a single row/column of the inverse matrix. I know the point of STANā€™s sampler is to move in many directions at once in parameter space, but Iā€™m curious if thereā€™s any way to work in this kind of efficiency (or if itā€™s already happening under the hood).

Thanks, I appreciate the feedback!

functions {
	matrix spCov(int N, real aD, matrix D, real mu){
		matrix[N,N] cov;
		for(i in 1:N){
			for(j in i:N){
				cov[i,j] = exp( -(aD* D[i,j])) + mu;
				cov[j,i] = cov[i,j];
			}
		}
		return cov;
	}
	matrix admixed_covariance(int N, int K, vector alphaD, matrix geoDist, matrix w_mat, vector nugget, vector mu, real gamma) {
		matrix[N,N] parCov;
		matrix[N,N] Nug_mat;
		parCov = rep_matrix(0,N,N);
		Nug_mat = diag_matrix(nugget);
		for(k in 1:K){
			parCov = parCov + tcrossprod(to_matrix(w_mat[,k])) .* spCov(N,alphaD[k],,geoDist,mu[k]);
		}
		parCov = gamma + parCov + Nug_mat;
		return parCov;	
	}
	matrix make_w_matrix(int N, int K, vector[] w){
		matrix[N,K] w_mat;
		for(i in 1:N){
			w_mat[i] = to_row_vector(w[i]);
		}
		return w_mat;
	}	
}
data {
	int<lower=1> K;		  				// number of layers
	int<lower=2> N; 	  				        // number of samples
	int<lower=N+1> L;	    			                // number of loci
	matrix[N,N] obsCov; 				        // observed sample covariance
	matrix[N, N] geoDist; 				        // matrix of pairwise geographic distance
	real varMeanFreqs;					        // variance in mean allele frequencies across loci
	real<lower=0,upper=1> temp;			// temperature parameter for estimating marginal likelihood	
}
parameters {
	vector<lower=0>[K] alphaD;			// effect of geographic distance in the parametric covariance in layer k
	positive_ordered[K] mu;				// shared drift effect in layer k
  	vector<lower=0>[N] nugget; 			// sample-specific variance (allele sampling error + sample-specific drift)
	simplex[K]    w[N];    				// every sample (N in total) has a K simplex (i.e. K layers)
	real<lower=0> gamma;				// baseline covariance between all layers
}
transformed parameters {
	matrix[N,N] parCov;					// this specifies the parametric, admixed covariance matrix
	vector[K] dirConPar;				        // concentration parameters of dirichlet prior on w
	matrix[N,K] w_mat;					        // converting w simplex to matrix for indexing
	dirConPar = rep_vector(0.1,K);
	w_mat = make_w_matrix(N,K,w);
	parCov = admixed_covariance(N, K, alphaD, geoDist, w_mat, nugget, mu, gamma);
}
model {
	alphaD ~ normal(0,1);							// prior on alphaD
	nugget ~ normal(0,1);							// prior on nugget
	mu ~ normal(0,1);								// prior on nugget
	gamma ~ normal(varMeanFreqs,0.5);				// prior on nugget
	for(i in 1:N) w[i] ~ dirichlet(dirConPar);				// prior on admixture proportions
	(L*obsCov) ~ wishart(L,parCov);					// likelihood function	
}
data {
  int<lower=1> K;
  int<lower=K+2> nu;
  matrix[K, K] B; // scale matrix
}
transformed data {
  matrix[K, K] L;
  matrix[K, K] eye;
  matrix[K, K] L_inv;
  
  L = cholesky_decompose(B);  
  for (j in 1:K) {
for (i in 1:K) {
  eye[i, j] = 0.0;
}
eye[j, j] = 1.0;
  }
  L_inv = mdivide_left_tri_low(L, eye);
}

parameters {
  vector<lower=0>[K] c;
  vector[K * (K - 1) / 2] z;
}

transformed parameters {
  matrix[K, K] A;
  matrix[K, K] A_inv_L_inv;
  {
int count;
count = 1;
for(j in 1:(K-1)) {
  for(i in (j+1):K) {
    A[i, j] = z[count];
    count = count + 1;
  }
  for(i in 1:(j - 1)) {
    A[i, j] = 0.0;
  }
  A[j, j] = sqrt(c[j]);
}
for(i in 1:(K-1))
  A[i, K] = 0;
A[K, K] = sqrt(c[K]);
A_inv_L_inv = mdivide_left_tri_low(A, L_inv);
  }
}

model {
  for(i in 1:K)
c[i] ~ chi_square(nu - i + 1);
  z ~ normal(0, 1); 
  // implies: crossprod(A_inv_L_inv) ~
  // inv_wishart(nu, L_inv' * L_inv)
}

Try the Barlett Wishart decomposition.

Thanks @Andre_Pfeuffer! Because weā€™re treating the scale matrix as a parameter to be estimated, I think the parameterization you suggest wonā€™t offer any speedups, because we wonā€™t be able to make use of the precomputation.

Right now, I have to take the inverse of parCov every iteration. As I understand it, if I were to use the Bartlett decomposition, every iteration Iā€™d have to calculate the matrix A as:

A = cholesky_decompose( quad_form( obsCov, inverse(cholesky_decompose(parCov)ā€™ ) ) )

and then I could model the diagonal of A as chisq and the off-diagonals as standard normals. So, Iā€™d still have to do a lot of costly computations every iteration.

The inverse of a cholesky with backward solving is cheap. Stan supports this. Take a look at Stan function
mdivide_left_tri_low
Further take a look at: https://math.stackexchange.com/questions/2174316/quadratic-forms-and-cholesky-factorization

Stan doesnā€™t need the conjugate, so you could try the wishart distribution. If you really want the inverse-wishart, there is also a trick around the inverse.

Thereā€™s also a way to directly map it to normal(0, 1), but this requires the jacobian adjustment, which is simple for a diagonal matrix.

The best thing to do would be to exploit conjugacy directly and avoid evaluating the normal altogether.

You should put this in transformed data becaue itā€™s a constant. Iā€™d also recommend using compound declare-define, e.gā€¦, matrix[N, K] w_mat = make_w_matrix(N, K, w);.

Itā€™d be even more efficient if you could work at the level of a Cholesky factor. The Jacobian looks daunting for that, but itā€™d be the way to go for underlying speed.

Is there a way that can be used to define the Wishart density directly on the Cholesky factor? I keep forgetting to ask @bgoodri about this.

https://en.wikipedia.org/wiki/Wishart_distribution

X = L * A * A^T * L^T = L * A * (L * A)^T

And from:
https://en.wikipedia.org/wiki/Cholesky_decomposition


ā€œThe Cholesky decomposition is unique when A is positive definite; there is only one lower triangular matrix L with strictly positive diagonal entries such that A = LL*. However, the decomposition need not be unique when A is positive semidefinite.ā€

So, L * A (1st citation) gives us the Cholesky decomposition of X. (LA is triangular, and uniqueness, if sym.psd)

Uniqueness of Cholesky (with positive diagonals) is why we can use it for the unconstrained representation of positive-definite matrix (type cov_matrix in Stan).

My question is whether we can implement wishart_cholesky(eta, L_V) where L_V is Cholesky factor of a positive-definite matrix with efficient derivatives. Ideally, the outcome would itself be a Cholesky factor, but it could also be a whole positive-definite matrix. So thatā€™d be

wishart_cholesky(cov_matrix X, real eta, cholesky_cov L_V)

wishart_cholesky(cholesky_cov L_X, real eta, cholesky_cov L_V)

To do that efficiently, we need to be able to express the Wishart in terms of Cholesky factors and efficiently compute derivatives.

Inverse Wishart would also be useful.

V is your scale matrix, and L is the lower cholesky decomposition matrix of V. The scale matrix is usually constant. It is the a-prior assumption to the ā€œcovarianceā€/DF factor. The eta in terms of LKJ would be a scale to the diagonal elements of V in wishart. eta in Stan is mostly also chosen a-priori.

The inverse-Wishart requires an inverse of the scale matrix. https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
The problem was the barlett-decomposition and I ended up with an upper triangular matrix. I solved this case
by applying a permutation matrix to the scale matrix beforehand and then a rearrangement in Stan instead of applying the permutation matrix again, see also: https://math.stackexchange.com/questions/1372166/a-ul-factorization
I did this 1 1/2 years back, still have the code though, but this needs some checking. Before somebody start to laugh about this approach, I better not post.

I agree that the typical use case has eta and L_V as data (double values) and X as a parameter (stan::math::varā€”the autodiff type).

We can parameterize however is convenient (scale matrix or inverse scale matrix) analogously to how we parameterize the multivariate normal.

Also, if we have this compound:

data {
  cov_matrix[K, K] V;
  real<lower = 0> nu;
  ...
parameters {
  cov_matrix[K] Sigma;
model {
  Sigma ~ inv_wishart(nu, V);  // or V_inv if that's better
  y ~ multi_normal(mu, Sigma);

isnā€™t the posterior for Sigma also inverse Wishart? If so, can we just code an inverse Wishart for the posterior directly and avoid having to compute the inverse Wishart prior and multivariate normal likelihood?

Bob,

https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
see subsection conjungate distribution. I hope you mean that.

There are some nice variants of IW here:
https://arxiv.org/pdf/1408.4050.pdf

Yes.

Rather than scaled inverse wishart, we tend to scale an LKJ prior on the correlation matrix. In practice, thatā€™s done with a Cholesky factor for the correlation matrix, which is more efficient and numerically stable. Thereā€™s no conjugacy to exploit there.

What Iā€™d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something like

data {
  cov_matrix[K] Psi;
  ...
parameters {
  cholesky_factor_cov[K] L;

model {
  mu ~ multi_normal(...);
  L ~ cholesky_inv_wishart(nu, Psi);
  y ~ multi_normal_cholesky(mu, L);

Then, given that everythingā€™s conjugate, itā€™d be nice to have an even more efficient expression for L and mu in the posterior that just involves a single Wishart term for L exploiting conjugacy and a single term for mu also exploiting conjugacy. I think thatā€™d be possible. Not sure how to write the function down since Iā€™m not that good with these distributions. But the point is that you donā€™t need to write the joint log density, you can write anything proportional to the posterior log density.

ā€œis more efficient and numerically stableā€, only for small dimensions.
I experienced a log(0) errors of Stan, using dimension around 50 for LKJ. See also: Ignacio Alvarez,
Bayesian Inference for a Covariance Matrix:
ā€œIn high dimensions (d > 10), the LKJ prior concentrates mass near zero correlations and therefore we do not use it here.ā€ (2.4) link above

The Wishart/InvWishart, especially the Barlett decomp. of Wishart is really fast to sample.
Iā€™m not against LKJ, its great for low dimensions and smaller models.

ā€œWhat Iā€™d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something likeā€

I have to think about some time and read some papers.

The Normal Inverse Wishart Distribution is defined here:

https://math.stackexchange.com/questions/2469310/computing-the-normalisation-of-posterior-normal-inverse-wishart

details here:

http://www.kamperh.com/notes/kamper_bayesgmm13.pdf

For Stan, using the Cholesky factor is always more efficient and stable. If you declare a variable as a covariance matrix, as in

cov_matrix[K] Sigma;

then it is represented as a Cholesky factor (with log diagonals) under the hood in unconstrained parameter space. Then itā€™s multiplied by its transpose to get Sigma and the log Jacobian of the product and exponentiation is added to the target. Then it gets factored and the derivative of that is calculated.

For the reparameterization of the inverse Wishart distribution, how is the cholesky factor of the inverse Wishart distribution with scale V passed to multi_normal_cholesky? Assuming I use the reparameterization suggested by the Stan reference,

Would it be:

multi_normal_cholesky(mu, A_inv_L_inv);

Thank you.

I donā€™t think we have built-ins for any of that. Our multi-normal on the Cholesky scale takes a covariance matrix, not a precision matrix.

There may be a clever way to go from Cholesky factor of covariance to Cholesky factor of precision. Itā€™d be better if we just had a multi-normal Cholesky precision distribution. You could write that one yourself, but thereā€™s still a log determinant term that I also think we have not specialized to Cholesky factors.

@bgoodri and @anon75146577 and @rtrangucci know much more about this than me (so do a lot of other people, but the three listed above are actively working on building out our general matrix libs efficiently).

It got merged

although you said that you were going to generalize the testing framework in light of it. In any event, it is not exposed to the Stan language in any released version of Stan yet, but it is just algebra and index fiddling.

That still needs to be done, but in the interim, we can add the function. I assigned myself: