Problem using cov_exp_quad with large matrices solving Log-Gaussian Cox Processes

Hi, I am dealing with estimating the intensity map of the brain using Log-Gaussian Cox Processes.
I get bad_alloc error while calculating covariance matrices, with 64GB of RAM.
I guess it is from intensive calculation using cov_exp_quad for all the voxels.

model {

    { 
    matrix[V, V] cov[2];
    matrix[V, V] L_cov[2];
    matrix[V, K] betas;
 
    // covariance matrix
    // for beta0
    cov[1] =   cov_exp_quad(brain_mask, sigma[1], rho[1])
                         + diag_matrix(rep_vector(square(noise[1]), V));
    L_cov[1] = cholesky_decompose(cov[1]);
    // for beta1
    cov[2] =   cov_exp_quad(brain_mask, sigma[1], rho[1])
                         + diag_matrix(rep_vector(square(noise[2]), V));                    
    L_cov[2] = cholesky_decompose(cov[2]);
    
    // betas are later used for calculaing lambda
    betas[,1] = mu[1] + L_cov[1]*eta[1];
    betas[,2] = mu[2] + L_cov[2]*eta[2];
    }
  
    ...

  }

The total number of voxels are now 48846, from 4mm isotropic brain mask.
The number has already been reduced from 352328.

Is there any workaround to get things done?
Would it be a better idea to start using GPUs?

Thank you in advance for all the comments.

Hi and welcome to the Stan Discourse,

GPUs will not help you here as you still need to allocate everything in the CPU memory first.

If I understand correct, V=48846. Which then means that a [V, V] matrix is 48k x 48k. Such a matrix that would store values and adjoints (8+8 bytes per matrix element) would be roughly 34GB in size. And you have at least 4 such matrices here. So you will not be able to fit those in 64GB.

Just out of curiousity, are the matrices sparse?

Hi, @rok_cesnovar. Thanks for the comment.

Yes, the size of the matrix is 48k x 48k.

brain mask in cov_exp_quad is an array of voxel locations in the voxel space. It’s already given and distances between voxels are calculated. So I don’t think the matrix is sparse.

If memory matters, would 256 gigs of RAM work?
I hope there’s another way for more efficient processing that reduces RAM usage.

Memory is definitely the first limiting factor here. Its not the intensive calculation of cov_exp_quad, its the fact that the output of cov_exp_quad is the large matrix and that is also where a lot of the resources are allocated on the autodiff stack.

If we look at what is allocated quickly.
A single cov_exp_quad (see rev implementation here) with V=48846 allocates 17.7GB for the dist_ array (array of doubles), allocates 35GB of varis for cov_lower_ and also stores 8.8GB of pointers to the varis. There is also some other extra stuff but that is negligible. That would sum up to 61,5GB of allocated resources for a single cov_exp_quad.

The cholesky decompose for V=48846 then allocates 35GB of extra varis and roughly 17GB of pointers. So that is an extra 52GB. You have 2 of each so that sums up to almost 230GB just for those. So I am not sure 256GB is going to be enough here. And you have to know that the Stan Math library already does a lot of optimizations to save space everywhere it can, so you wont get any further with a general autodiff approach.

If you have the option of using cloud instances, I would advise you to try an AWS High Memory cloud instance with TBs of memory to at least see what is the maximum required memory here. You can rent those for a few $ an hour and I cant imagine you would need more than an hour or two to get it to start sampling. It probably wont be feasible for the entire sampling as I suspect this is going to take quite some time with these sizes. Just to get a feel of what are the required memory restrictions.

Other options require extensive C++ knowledge and knowledge of the autodiff Stan Math library and require an additional effort. You can potentially calculate the gradient manually for your specific model and implement it in C++. You would save memory space for some of the intermediate calculations there. The other option is to rework the autodiff to use floats instead of doubles. You save half the space for varis and doubles there, but you have a lot of potential issues there with precision with cholesky decompose and even the sampler itself.

2 Likes

Thanks for the valuable comment.

With those restrictions, even if I figure out how much memories does the model need,
it might not be feasible to do the sampling as you mentioned.

I will try to think about changing the whole model structure.

Thank you again!

1 Like

Hi, @rok_cesnovar. I just have a quick question in mind.

Would it be possible to reduce RAM usage, for example,
write a custom c++ function

  1. that grabs mu, eta, brain_mask, sigma, rho as inputs and
  2. compute internally other method than cov_exp_quad to calculate the covariance function (such as discrete fft)
  3. return beta

I have no knowledge about autodiff algorithm so I do not know whether I need to provide a gradient.

Thanks in advance.
Minho

Hi @mshin,

yes, creating a custom C++ function is probably the way to go to reduce the memory consumption.

One way is to create a C++ function that uses other ways (more memory friendly) to calculate the covariance function. If all the functions you use already have their gradients written in Stan Math, then you dont have to actually do much more than that. If not all have that, you will need to provide the gradient calculations for those missing gradients.

The most memory effective way is to create a custom function and also provide a gradient for that function without relying on autodiff.

Writing gradients requires a bit of advanced knowledge of C++ but not really all that much. In essence you need to be able to calculate the function value and the gradient in C++. For starters juts create the function value function foo() and the gradient function foo_grad() in basic C++ without thinking about autodiff. Once you have that most of your work is done.

You then need to add that in Stan Math so that the autodiff “knows” how to connect foo and foo_grad together. There are instruction on how to do that on the wiki: https://github.com/stan-dev/math/wiki/Adding-a-new-function-with-known-gradients
But also feel free to also ask questions here. Happy to help.

2 Likes

Hi, @rok_cesnovar . Thanks for the quick reply.

I will try to implement the function myself and will update the progress :)

1 Like

I noticed this only now. For this size of GP, I would recommend to use GP specific software like GPy, GPFlow or GPyTorch. Even if the memory would not be a problem each Cholesky would take so much time that you don’t want to sample hyperparameters. GP specific frameworks allow faster approximation of GP, non-MCMC approximate integration over the latent values, and are designed specifically to save also memory in GP computations.

If you are using a GP as part of the model which is difficult to implement in these frameworks, but would be easy to implement in Stan, you could consider using Gaussian Markov random field model instead, and then in a few months when we have sparse matrix support this might be feasible in Stan.

2 Likes

Hi, @avehtari. Thanks for the comment.

Thank you for the recommendation. I will sure do try it.

Do you mean that if I wait for sparse matrix support, then I can fit the GP in Stan? and until then, I can try GMRF?

Once again, thank you for the helpful comment!
Minho

Sparse matrix support is not useful for GPs represented with covariance matrices. Sparse matrix support makes GMRFs much faster, and for the problem size you have you need much faster. You could use GMRF as a spatial prior instead of GP. As you have Cox process, you want to also faster integration over the latent values and you could eventually use Laplace method, which is also coming in the future.

Thank you for the comment,
then I can try GPflow stuff to get approximate answers and then wait for future release of Stan.

Actually, what I was trying while having computing problems with LGCPs was Sparse Latent Factor Analysis, to reduce the amount of computation.

Rather than ~50k voxels, I set ~500 3D Gaussian Kernels as bases functions and estimate factors as linear combination of bases functions. To reduce the number of factors, I tried to implement sparse-inducing priors for factor-loadings matrix such as as horseshoe priors, but I eventually ended up implementing Dirichlet-Laplace(DL) priors like applied in here.
And I am still having difficulties, may be derived from identifiability I guess…?

Is there any advice on implementing sparse-inducing prior for factor-loadings matrix?
I already opened such questions, but still not answered yet, like this and this.

Thank you for the valuable comments!
You already answered many of my questions that I had struggling with for a long time.
Minho

Sorry, I don’t have advice on sparse-inducing prior for factor-loadings matrix. I can just assume that this is a challenging topic.

Hi, @avehtari. Thank you again for the comment.

I already really appreciate your help for the problem!
I could have at least one way out to solve this problem (with GPstuff), thanks to you.
With factor analysis, I will try to follow others’ works on factor-loadings .

Hope you have a great day!
Minho

1 Like