Distance matrices for very large datasets

centenoalvaradodiego · May 3, 2024, 8:04am

I am seeking to use your brms package to evaluate geographical data. I am planning to use an Euclidean distance matrix as a covariate. However, the big problem that I have is that I have more than 3 million rows in my dataset and I can’t seem to find a function that works to compute my distance matrix. Have you come across a package and a function that can handle EXTREMELY large datasets?

mhollanders · May 3, 2024, 11:02am

Depending on what you’re using the distance matrix for, you could potentially just use the coordinates as two covariates and then use the Hilbert space GPs in brms.

centenoalvaradodiego · May 3, 2024, 12:07pm

I have a a model with species occurrence records and I need to account for spatial autocorrelation. How would be the code to use the coordinates as two covariates and then use the Hilbert space GPs?

mhollanders · May 6, 2024, 2:24am

Check this out. I’m not super familiar with brms but I think you just use the formula syntax to do GPs with your lon/lat (or probably better utm?) and then specify k=10 or something for the basis functions.

martinmodrak · May 6, 2024, 4:42am

As another possibility, the INLA software was designed specifically with large spatial datasets in mind, so it might provide a trick or two to work with your data. I’d agree that modelling the spatial variability directly might be preferable to considering pairwise differences.

With that said, if you really want to work with pairs (e.g. as in Diversity | Free Full-Text | BetaBayes—A Bayesian Approach for Comparing Ecological Communities) you’ll almost definitely need some more specialized big data tools and/or a computing cluster - with 3 million rows, you have 4.5x10^12 distances (pairs). With an 8 byte double to store each distance, you’ll need ~36 000 GB just to store the distances.

Presumably reducing the pairs to some K-nearest neighbour structure would reduce the footprint while staying reasonably accurate.

Topic		Replies	Views
Adjacency matrix of locations - CAR brms	3	192	May 14, 2024
Sparse adjacency matrix for CAR? brms spatial , brms	1	72	October 1, 2024
Incorporating distance matrices into Gaussian Process in brms brms gaussian-process	11	2850	September 17, 2020
Distance matrix regression brms	9	1461	July 21, 2018
Large multilevel dataset with 48 million rows: How to build data subsets for use in brm brms specification	2	758	September 13, 2022

Distance matrices for very large datasets

Related topics