Distance matrices for very large datasets

I am seeking to use your brms package to evaluate geographical data. I am planning to use an Euclidean distance matrix as a covariate. However, the big problem that I have is that I have more than 3 million rows in my dataset and I can’t seem to find a function that works to compute my distance matrix. Have you come across a package and a function that can handle EXTREMELY large datasets?

Depending on what you’re using the distance matrix for, you could potentially just use the coordinates as two covariates and then use the Hilbert space GPs in brms.

1 Like

I have a a model with species occurrence records and I need to account for spatial autocorrelation. How would be the code to use the coordinates as two covariates and then use the Hilbert space GPs?

Check this out. I’m not super familiar with brms but I think you just use the formula syntax to do GPs with your lon/lat (or probably better utm?) and then specify k=10 or something for the basis functions.

1 Like

As another possibility, the INLA software was designed specifically with large spatial datasets in mind, so it might provide a trick or two to work with your data. I’d agree that modelling the spatial variability directly might be preferable to considering pairwise differences.

With that said, if you really want to work with pairs (e.g. as in Diversity | Free Full-Text | BetaBayes—A Bayesian Approach for Comparing Ecological Communities) you’ll almost definitely need some more specialized big data tools and/or a computing cluster - with 3 million rows, you have 4.5x10^12 distances (pairs). With an 8 byte double to store each distance, you’ll need ~36 000 GB just to store the distances.

Presumably reducing the pairs to some K-nearest neighbour structure would reduce the footprint while staying reasonably accurate.