Large multilevel dataset with 48 million rows: How to build data subsets for use in brm

I have a large dataset with about 48 million rows, which I want to use for the prediction of land surface temperature (LST) based on a number of environmental variables using the R package brms. LST was derived from satellite data of 39 multi-year Landsat scenes in my study area of about 2000 km². The environmental predictors include land use class, canopy closure and others, some of which are static while others vary with time. Without considering interactions, I have 12 predictors. For model fitting, I use the index-variable approach. That is, each level of the categorical predictor land use class receives its own intercept.
The data is grouped both in space and time. The grouping in space is reflected in a considerable spatial structure of LST. Therefore, I created a categorical variable “group” that reflects land use clusters such as “field A”, “field B”… or “forest stand A”, “forest stand B” etc. The reasoning behind this is that the groups contain characteristics that are not described by the predictors but likely influence LST (for instance, the farmer of field A plows regularly while the farmer of field B practices no-till agriculture). The grouping in time is due to the repeated measurements of LST (up to 39 Landsat scenes). I therefore assigned an ID to each LST pixel, each of which occurs up to 39 times in the dataset. However, due to clouds, there are many LST pixels with a lower number of repeated observations in time. In general, the data are very unbalanced due to the observational nature of the study.
In order to pass the hierarchical data structure to the brm function, each LST pixel (ID) and each group is given its own intercept. The general formula in brm is:

LST ~ 0 + predictor_1 (land use class) + predictor_2 + … + predictor_n + (0 + pred_1 || group/ID)

I use group/ID because the IDs are clumped within the groups.

Because the entire dataset is too big for brm, I have to build data subsets. So far I selected a subset with 1 ID per group, which already gives reasonable results. Other options would be to use simple random sampling of the whole dataset or stratified random sampling with the groups as strata. In each case, I could create many subsets and use brm_multiple to average the results.

Does anybody have experience with data sampling prior to the use of brm (sampling design, sample size and alike) and can give me some tips or references? Is there an option of data sampling in the model fitting procedure which I have overlooked so far?

  • Operating System: Ubuntu
  • brms Version: 2.16.1

I am not sure if this exactly what you are looking for but you might be able to use some post-stratification methods

Thank you for your suggestion. I applied post-stratification with the combination of land-use class and the variable „group“ as strata. I randomly selected one or two IDs per stratum (each LST pixel was observed several times and received a unique ID). This step reduced both dataset size and spatial autocorrelation of LST.