Hi, I am dealing with estimating the intensity map of the brain using Log-Gaussian Cox Processes.
I get bad_alloc error while calculating covariance matrices, with 64GB of RAM.
I guess it is from intensive calculation using cov_exp_quad for all the voxels.

model {
{
matrix[V, V] cov[2];
matrix[V, V] L_cov[2];
matrix[V, K] betas;
// covariance matrix
// for beta0
cov[1] = cov_exp_quad(brain_mask, sigma[1], rho[1])
+ diag_matrix(rep_vector(square(noise[1]), V));
L_cov[1] = cholesky_decompose(cov[1]);
// for beta1
cov[2] = cov_exp_quad(brain_mask, sigma[1], rho[1])
+ diag_matrix(rep_vector(square(noise[2]), V));
L_cov[2] = cholesky_decompose(cov[2]);
// betas are later used for calculaing lambda
betas[,1] = mu[1] + L_cov[1]*eta[1];
betas[,2] = mu[2] + L_cov[2]*eta[2];
}
...
}

The total number of voxels are now 48846, from 4mm isotropic brain mask.
The number has already been reduced from 352328.

Is there any workaround to get things done?
Would it be a better idea to start using GPUs?

GPUs will not help you here as you still need to allocate everything in the CPU memory first.

If I understand correct, V=48846. Which then means that a [V, V] matrix is 48k x 48k. Such a matrix that would store values and adjoints (8+8 bytes per matrix element) would be roughly 34GB in size. And you have at least 4 such matrices here. So you will not be able to fit those in 64GB.

brain mask in cov_exp_quad is an array of voxel locations in the voxel space. Itâ€™s already given and distances between voxels are calculated. So I donâ€™t think the matrix is sparse.

If memory matters, would 256 gigs of RAM work?
I hope thereâ€™s another way for more efficient processing that reduces RAM usage.

Memory is definitely the first limiting factor here. Its not the intensive calculation of cov_exp_quad, its the fact that the output of cov_exp_quad is the large matrix and that is also where a lot of the resources are allocated on the autodiff stack.

If we look at what is allocated quickly.
A single cov_exp_quad (see rev implementation here) with V=48846 allocates 17.7GB for the dist_ array (array of doubles), allocates 35GB of varis for cov_lower_ and also stores 8.8GB of pointers to the varis. There is also some other extra stuff but that is negligible. That would sum up to 61,5GB of allocated resources for a single cov_exp_quad.

The cholesky decompose for V=48846 then allocates 35GB of extra varis and roughly 17GB of pointers. So that is an extra 52GB. You have 2 of each so that sums up to almost 230GB just for those. So I am not sure 256GB is going to be enough here. And you have to know that the Stan Math library already does a lot of optimizations to save space everywhere it can, so you wont get any further with a general autodiff approach.

If you have the option of using cloud instances, I would advise you to try an AWS High Memory cloud instance with TBs of memory to at least see what is the maximum required memory here. You can rent those for a few $ an hour and I cant imagine you would need more than an hour or two to get it to start sampling. It probably wont be feasible for the entire sampling as I suspect this is going to take quite some time with these sizes. Just to get a feel of what are the required memory restrictions.

Other options require extensive C++ knowledge and knowledge of the autodiff Stan Math library and require an additional effort. You can potentially calculate the gradient manually for your specific model and implement it in C++. You would save memory space for some of the intermediate calculations there. The other option is to rework the autodiff to use floats instead of doubles. You save half the space for varis and doubles there, but you have a lot of potential issues there with precision with cholesky decompose and even the sampler itself.

yes, creating a custom C++ function is probably the way to go to reduce the memory consumption.

One way is to create a C++ function that uses other ways (more memory friendly) to calculate the covariance function. If all the functions you use already have their gradients written in Stan Math, then you dont have to actually do much more than that. If not all have that, you will need to provide the gradient calculations for those missing gradients.

The most memory effective way is to create a custom function and also provide a gradient for that function without relying on autodiff.

Writing gradients requires a bit of advanced knowledge of C++ but not really all that much. In essence you need to be able to calculate the function value and the gradient in C++. For starters juts create the function value function foo() and the gradient function foo_grad() in basic C++ without thinking about autodiff. Once you have that most of your work is done.

I noticed this only now. For this size of GP, I would recommend to use GP specific software like GPy, GPFlow or GPyTorch. Even if the memory would not be a problem each Cholesky would take so much time that you donâ€™t want to sample hyperparameters. GP specific frameworks allow faster approximation of GP, non-MCMC approximate integration over the latent values, and are designed specifically to save also memory in GP computations.

If you are using a GP as part of the model which is difficult to implement in these frameworks, but would be easy to implement in Stan, you could consider using Gaussian Markov random field model instead, and then in a few months when we have sparse matrix support this might be feasible in Stan.

Sparse matrix support is not useful for GPs represented with covariance matrices. Sparse matrix support makes GMRFs much faster, and for the problem size you have you need much faster. You could use GMRF as a spatial prior instead of GP. As you have Cox process, you want to also faster integration over the latent values and you could eventually use Laplace method, which is also coming in the future.

Thank you for the comment,
then I can try GPflow stuff to get approximate answers and then wait for future release of Stan.

Actually, what I was trying while having computing problems with LGCPs was Sparse Latent Factor Analysis, to reduce the amount of computation.

Rather than ~50k voxels, I set ~500 3D Gaussian Kernels as bases functions and estimate factors as linear combination of bases functions. To reduce the number of factors, I tried to implement sparse-inducing priors for factor-loadings matrix such as as horseshoe priors, but I eventually ended up implementing Dirichlet-Laplace(DL) priors like applied in here.
And I am still having difficulties, may be derived from identifiability I guessâ€¦?

Is there any advice on implementing sparse-inducing prior for factor-loadings matrix?
I already opened such questions, but still not answered yet, like this and this.

Thank you for the valuable comments!
You already answered many of my questions that I had struggling with for a long time.
Minho

I already really appreciate your help for the problem!
I could have at least one way out to solve this problem (with GPstuff), thanks to you.
With factor analysis, I will try to follow othersâ€™ works on factor-loadings .