Large multilevel dataset with 48 million rows: How to build data subsets for use in brm

bzimmer · June 28, 2022, 1:06pm

I have a large dataset with about 48 million rows, which I want to use for the prediction of land surface temperature (LST) based on a number of environmental variables using the R package brms. LST was derived from satellite data of 39 multi-year Landsat scenes in my study area of about 2000 km². The environmental predictors include land use class, canopy closure and others, some of which are static while others vary with time. Without considering interactions, I have 12 predictors. For model fitting, I use the index-variable approach. That is, each level of the categorical predictor land use class receives its own intercept.
The data is grouped both in space and time. The grouping in space is reflected in a considerable spatial structure of LST. Therefore, I created a categorical variable “group” that reflects land use clusters such as “field A”, “field B”… or “forest stand A”, “forest stand B” etc. The reasoning behind this is that the groups contain characteristics that are not described by the predictors but likely influence LST (for instance, the farmer of field A plows regularly while the farmer of field B practices no-till agriculture). The grouping in time is due to the repeated measurements of LST (up to 39 Landsat scenes). I therefore assigned an ID to each LST pixel, each of which occurs up to 39 times in the dataset. However, due to clouds, there are many LST pixels with a lower number of repeated observations in time. In general, the data are very unbalanced due to the observational nature of the study.
In order to pass the hierarchical data structure to the brm function, each LST pixel (ID) and each group is given its own intercept. The general formula in brm is:

LST ~ 0 + predictor_1 (land use class) + predictor_2 + … + predictor_n + (0 + pred_1 || group/ID)

I use group/ID because the IDs are clumped within the groups.

Because the entire dataset is too big for brm, I have to build data subsets. So far I selected a subset with 1 ID per group, which already gives reasonable results. Other options would be to use simple random sampling of the whole dataset or stratified random sampling with the groups as strata. In each case, I could create many subsets and use brm_multiple to average the results.

Does anybody have experience with data sampling prior to the use of brm (sampling design, sample size and alike) and can give me some tips or references? Is there an option of data sampling in the model fitting procedure which I have overlooked so far?

Operating System: Ubuntu
brms Version: 2.16.1

Ara_Winter · August 9, 2022, 7:22pm

I am not sure if this exactly what you are looking for but you might be able to use some post-stratification methods

bzimmer · September 13, 2022, 8:39am

Thank you for your suggestion. I applied post-stratification with the combination of land-use class and the variable „group“ as strata. I randomly selected one or two IDs per stratum (each LST pixel was observed several times and received a unique ID). This step reduced both dataset size and spatial autocorrelation of LST.

Topic		Replies	Views
Spatial conditional autoregressive (CAR) term in brms with missing observations Modeling specification	2	1326	March 22, 2021
Prediction using spatial layers (rasters) with brms.fit brms	12	1403	May 26, 2021
Needing help using a Predict Function in brms Modeling	5	437	September 5, 2019
Categorical model with large dataset - large/divergent ELBO & possible combined reduce_sum()/GPU support Modeling cmdstan , fitting-issues , specification , performance	13	782	May 13, 2022
Brms fails with SAR structure at intermediate/large sample sizes brms brms	3	822	April 25, 2022

Large multilevel dataset with 48 million rows: How to build data subsets for use in brm

Related topics