Speedups for Wishart model

bburd · October 18, 2017, 3:36pm

Hello,

I have a spatial population genetic clustering method in which samples are assigned to, or admixed between, a user-specified number of discrete groups within which relatedness decays with distance. Each group is modeled as a spatial Gaussian process, so the full model is a mixture of spatial Gaussian processes. The data can consist of between tens of thousands to millions of independent observations, so, for efficiency, I summarize with the observed covariance between samples across observations, and model that covariance as a draw from a Wishart with a parametric scale matrix (detailed in the model block below). Implemented in STAN, the model works great for a smallish number of samples (fewer than ~100), but it’s quite slow for larger datasets.

I have two questions:

Are there any low-hanging fruit speedups that you can see, either in reworking the syntax or reparameterizations? I know the STAN crew is generally anti-Wishart as a prior, but, because there are so many observations (potentially into the millions), I’m think that the Wishart is more efficient than the multivariate Gaussian for calculating the likelihood.
Are there ways to implement rank-one updates to the inverse of the Wishart’s parametric scale matrix in STAN? For some parameters, if I were to update only that parameter, I could use the Sherman-Morrison-Woodbury trick to update a single row/column of the inverse matrix. I know the point of STAN’s sampler is to move in many directions at once in parameter space, but I’m curious if there’s any way to work in this kind of efficiency (or if it’s already happening under the hood).

Thanks, I appreciate the feedback!

functions {
	matrix spCov(int N, real aD, matrix D, real mu){
		matrix[N,N] cov;
		for(i in 1:N){
			for(j in i:N){
				cov[i,j] = exp( -(aD* D[i,j])) + mu;
				cov[j,i] = cov[i,j];
			}
		}
		return cov;
	}
	matrix admixed_covariance(int N, int K, vector alphaD, matrix geoDist, matrix w_mat, vector nugget, vector mu, real gamma) {
		matrix[N,N] parCov;
		matrix[N,N] Nug_mat;
		parCov = rep_matrix(0,N,N);
		Nug_mat = diag_matrix(nugget);
		for(k in 1:K){
			parCov = parCov + tcrossprod(to_matrix(w_mat[,k])) .* spCov(N,alphaD[k],,geoDist,mu[k]);
		}
		parCov = gamma + parCov + Nug_mat;
		return parCov;	
	}
	matrix make_w_matrix(int N, int K, vector[] w){
		matrix[N,K] w_mat;
		for(i in 1:N){
			w_mat[i] = to_row_vector(w[i]);
		}
		return w_mat;
	}	
}
data {
	int<lower=1> K;		  				// number of layers
	int<lower=2> N; 	  				        // number of samples
	int<lower=N+1> L;	    			                // number of loci
	matrix[N,N] obsCov; 				        // observed sample covariance
	matrix[N, N] geoDist; 				        // matrix of pairwise geographic distance
	real varMeanFreqs;					        // variance in mean allele frequencies across loci
	real<lower=0,upper=1> temp;			// temperature parameter for estimating marginal likelihood	
}
parameters {
	vector<lower=0>[K] alphaD;			// effect of geographic distance in the parametric covariance in layer k
	positive_ordered[K] mu;				// shared drift effect in layer k
  	vector<lower=0>[N] nugget; 			// sample-specific variance (allele sampling error + sample-specific drift)
	simplex[K]    w[N];    				// every sample (N in total) has a K simplex (i.e. K layers)
	real<lower=0> gamma;				// baseline covariance between all layers
}
transformed parameters {
	matrix[N,N] parCov;					// this specifies the parametric, admixed covariance matrix
	vector[K] dirConPar;				        // concentration parameters of dirichlet prior on w
	matrix[N,K] w_mat;					        // converting w simplex to matrix for indexing
	dirConPar = rep_vector(0.1,K);
	w_mat = make_w_matrix(N,K,w);
	parCov = admixed_covariance(N, K, alphaD, geoDist, w_mat, nugget, mu, gamma);
}
model {
	alphaD ~ normal(0,1);							// prior on alphaD
	nugget ~ normal(0,1);							// prior on nugget
	mu ~ normal(0,1);								// prior on nugget
	gamma ~ normal(varMeanFreqs,0.5);				// prior on nugget
	for(i in 1:N) w[i] ~ dirichlet(dirConPar);				// prior on admixture proportions
	(L*obsCov) ~ wishart(L,parCov);					// likelihood function	
}

Andre_Pfeuffer · October 21, 2017, 3:14pm

data {
  int<lower=1> K;
  int<lower=K+2> nu;
  matrix[K, K] B; // scale matrix
}
transformed data {
  matrix[K, K] L;
  matrix[K, K] eye;
  matrix[K, K] L_inv;
  
  L = cholesky_decompose(B);  
  for (j in 1:K) {
for (i in 1:K) {
  eye[i, j] = 0.0;
}
eye[j, j] = 1.0;
  }
  L_inv = mdivide_left_tri_low(L, eye);
}

parameters {
  vector<lower=0>[K] c;
  vector[K * (K - 1) / 2] z;
}

transformed parameters {
  matrix[K, K] A;
  matrix[K, K] A_inv_L_inv;
  {
int count;
count = 1;
for(j in 1:(K-1)) {
  for(i in (j+1):K) {
    A[i, j] = z[count];
    count = count + 1;
  }
  for(i in 1:(j - 1)) {
    A[i, j] = 0.0;
  }
  A[j, j] = sqrt(c[j]);
}
for(i in 1:(K-1))
  A[i, K] = 0;
A[K, K] = sqrt(c[K]);
A_inv_L_inv = mdivide_left_tri_low(A, L_inv);
  }
}

model {
  for(i in 1:K)
c[i] ~ chi_square(nu - i + 1);
  z ~ normal(0, 1); 
  // implies: crossprod(A_inv_L_inv) ~
  // inv_wishart(nu, L_inv' * L_inv)
}

Try the Barlett Wishart decomposition.

bburd · October 23, 2017, 6:06pm

Thanks @Andre_Pfeuffer! Because we’re treating the scale matrix as a parameter to be estimated, I think the parameterization you suggest won’t offer any speedups, because we won’t be able to make use of the precomputation.

Right now, I have to take the inverse of parCov every iteration. As I understand it, if I were to use the Bartlett decomposition, every iteration I’d have to calculate the matrix A as:

A = cholesky_decompose( quad_form( obsCov, inverse(cholesky_decompose(parCov)’ ) ) )

and then I could model the diagonal of A as chisq and the off-diagonals as standard normals. So, I’d still have to do a lot of costly computations every iteration.

Andre_Pfeuffer · October 24, 2017, 3:48am

The inverse of a cholesky with backward solving is cheap. Stan supports this. Take a look at Stan function
mdivide_left_tri_low
Further take a look at: https://math.stackexchange.com/questions/2174316/quadratic-forms-and-cholesky-factorization

Stan doesn’t need the conjugate, so you could try the wishart distribution. If you really want the inverse-wishart, there is also a trick around the inverse.

There’s also a way to directly map it to normal(0, 1), but this requires the jacobian adjustment, which is simple for a diagonal matrix.

Bob_Carpenter · November 2, 2017, 9:26pm

The best thing to do would be to exploit conjugacy directly and avoid evaluating the normal altogether.

You should put this in transformed data becaue it’s a constant. I’d also recommend using compound declare-define, e.g…, matrix[N, K] w_mat = make_w_matrix(N, K, w);.

It’d be even more efficient if you could work at the level of a Cholesky factor. The Jacobian looks daunting for that, but it’d be the way to go for underlying speed.

Bob_Carpenter · November 2, 2017, 9:28pm

Is there a way that can be used to define the Wishart density directly on the Cholesky factor? I keep forgetting to ask @bgoodri about this.

Andre_Pfeuffer · November 3, 2017, 3:11am

https://en.wikipedia.org/wiki/Wishart_distribution

X = L * A * A^T * L^T = L * A * (L * A)^T

And from:
https://en.wikipedia.org/wiki/Cholesky_decomposition

“The Cholesky decomposition is unique when A is positive definite; there is only one lower triangular matrix L with strictly positive diagonal entries such that A = LL*. However, the decomposition need not be unique when A is positive semidefinite.”

So, L * A (1st citation) gives us the Cholesky decomposition of X. (LA is triangular, and uniqueness, if sym.psd)

Bob_Carpenter · November 3, 2017, 1:52pm

Uniqueness of Cholesky (with positive diagonals) is why we can use it for the unconstrained representation of positive-definite matrix (type cov_matrix in Stan).

My question is whether we can implement wishart_cholesky(eta, L_V) where L_V is Cholesky factor of a positive-definite matrix with efficient derivatives. Ideally, the outcome would itself be a Cholesky factor, but it could also be a whole positive-definite matrix. So that’d be

wishart_cholesky(cov_matrix X, real eta, cholesky_cov L_V)

wishart_cholesky(cholesky_cov L_X, real eta, cholesky_cov L_V)

To do that efficiently, we need to be able to express the Wishart in terms of Cholesky factors and efficiently compute derivatives.

Inverse Wishart would also be useful.

Andre_Pfeuffer · November 3, 2017, 3:02pm

V is your scale matrix, and L is the lower cholesky decomposition matrix of V. The scale matrix is usually constant. It is the a-prior assumption to the “covariance”/DF factor. The eta in terms of LKJ would be a scale to the diagonal elements of V in wishart. eta in Stan is mostly also chosen a-priori.

The inverse-Wishart requires an inverse of the scale matrix. https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
The problem was the barlett-decomposition and I ended up with an upper triangular matrix. I solved this case
by applying a permutation matrix to the scale matrix beforehand and then a rearrangement in Stan instead of applying the permutation matrix again, see also: https://math.stackexchange.com/questions/1372166/a-ul-factorization
I did this 1 1/2 years back, still have the code though, but this needs some checking. Before somebody start to laugh about this approach, I better not post.

Bob_Carpenter · November 3, 2017, 3:21pm

I agree that the typical use case has eta and L_V as data (double values) and X as a parameter (stan::math::var—the autodiff type).

We can parameterize however is convenient (scale matrix or inverse scale matrix) analogously to how we parameterize the multivariate normal.

Also, if we have this compound:

data {
  cov_matrix[K, K] V;
  real<lower = 0> nu;
  ...
parameters {
  cov_matrix[K] Sigma;
model {
  Sigma ~ inv_wishart(nu, V);  // or V_inv if that's better
  y ~ multi_normal(mu, Sigma);

isn’t the posterior for Sigma also inverse Wishart? If so, can we just code an inverse Wishart for the posterior directly and avoid having to compute the inverse Wishart prior and multivariate normal likelihood?

Andre_Pfeuffer · November 3, 2017, 4:19pm

Bob,

https://en.wikipedia.org/wiki/Inverse-Wishart_distribution
see subsection conjungate distribution. I hope you mean that.

There are some nice variants of IW here:
https://arxiv.org/pdf/1408.4050.pdf

Bob_Carpenter · November 5, 2017, 11:23pm

Yes.

Bob_Carpenter · November 5, 2017, 11:30pm

Rather than scaled inverse wishart, we tend to scale an LKJ prior on the correlation matrix. In practice, that’s done with a Cholesky factor for the correlation matrix, which is more efficient and numerically stable. There’s no conjugacy to exploit there.

What I’d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something like

data {
  cov_matrix[K] Psi;
  ...
parameters {
  cholesky_factor_cov[K] L;

model {
  mu ~ multi_normal(...);
  L ~ cholesky_inv_wishart(nu, Psi);
  y ~ multi_normal_cholesky(mu, L);

Then, given that everything’s conjugate, it’d be nice to have an even more efficient expression for L and mu in the posterior that just involves a single Wishart term for L exploiting conjugacy and a single term for mu also exploiting conjugacy. I think that’d be possible. Not sure how to write the function down since I’m not that good with these distributions. But the point is that you don’t need to write the joint log density, you can write anything proportional to the posterior log density.

Andre_Pfeuffer · November 6, 2017, 2:33am

“is more efficient and numerically stable”, only for small dimensions.
I experienced a log(0) errors of Stan, using dimension around 50 for LKJ. See also: Ignacio Alvarez,
Bayesian Inference for a Covariance Matrix:
“In high dimensions (d > 10), the LKJ prior concentrates mass near zero correlations and therefore we do not use it here.” (2.4) link above

The Wishart/InvWishart, especially the Barlett decomp. of Wishart is really fast to sample.
I’m not against LKJ, its great for low dimensions and smaller models.

“What I’d like to be able to do is have everything done with Cholesky factors for the Wishart cases if possible. Something like”

I have to think about some time and read some papers.

Andre_Pfeuffer · November 7, 2017, 2:45am

The Normal Inverse Wishart Distribution is defined here:

https://math.stackexchange.com/questions/2469310/computing-the-normalisation-of-posterior-normal-inverse-wishart

details here:

http://www.kamperh.com/notes/kamper_bayesgmm13.pdf

Bob_Carpenter · November 7, 2017, 7:16pm

For Stan, using the Cholesky factor is always more efficient and stable. If you declare a variable as a covariance matrix, as in

cov_matrix[K] Sigma;

then it is represented as a Cholesky factor (with log diagonals) under the hood in unconstrained parameter space. Then it’s multiplied by its transpose to get Sigma and the log Jacobian of the product and exponentiation is added to the target. Then it gets factored and the derivative of that is calculated.

goodmattg · December 23, 2017, 4:37am

For the reparameterization of the inverse Wishart distribution, how is the cholesky factor of the inverse Wishart distribution with scale V passed to multi_normal_cholesky? Assuming I use the reparameterization suggested by the Stan reference,

Would it be:

multi_normal_cholesky(mu, A_inv_L_inv);

Thank you.

Bob_Carpenter · December 26, 2017, 4:42pm

I don’t think we have built-ins for any of that. Our multi-normal on the Cholesky scale takes a covariance matrix, not a precision matrix.

There may be a clever way to go from Cholesky factor of covariance to Cholesky factor of precision. It’d be better if we just had a multi-normal Cholesky precision distribution. You could write that one yourself, but there’s still a log determinant term that I also think we have not specialized to Cholesky factors.

@bgoodri and @anon75146577 and @rtrangucci know much more about this than me (so do a lot of other people, but the three listed above are actively working on building out our general matrix libs efficiently).

bgoodri · December 26, 2017, 7:07pm

It got merged

github.com/stan-dev/math

Feature/chol2inv

stan-dev:develop ← stan-dev:feature/chol2inv

opened 07:55PM - 18 Oct 17 UTC

bgoodri

+107 -0

#### Submission Checklist - [X] Run unit tests: `./runTests.py test/unit` - …[X] Run cpplint: `make cpplint` - [X] Declare copyright holder and open-source license: see below #### Summary: Introduce a function that inputs a matrix's Cholesky factor and outputs the matrix's inverse. #### Intended Effect: Fixes #648 #### How to Verify: Run the tests #### Side Effects: None #### Documentation: Will have to be added in a pull request for Stan #### Copyright and Licensing Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Trustees of Columbia University By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

although you said that you were going to generalize the testing framework in light of it. In any event, it is not exposed to the Stan language in any released version of Stan yet, but it is just algebra and index fiddling.

Bob_Carpenter · December 26, 2017, 9:50pm

That still needs to be done, but in the interim, we can add the function. I assigned myself:

Topic		Replies	Views
Mixture model running slow Modeling	2	991	September 19, 2018
Speeding up multivariate normal model Modeling techniques	7	2980	September 9, 2017
Simple Hierarchical Spatial Inference Model Modeling hierarchical-model	7	686	January 20, 2021
[Case-study preview] Speeding up Stan by reducing redundant computation Publicity performance	8	1927	June 6, 2020
Very slow MCMC sampling in stan_jm() due to large number of individuals Modeling fitting-issues , performance , mixed-model , rstanarm , paralellization	6	988	April 30, 2021

Speedups for Wishart model

Related Topics