Is it possible to run Bayesian hierarchical model with 10million observations?

Sorry if the question looks dumb. But I’ve been struggling with this for a while: our dataset has 12 million observations, and we are trying Bayesian hierarchical models (and longitudinal), using brms. We have access to high-performance computing, but after a few tests I realize this may still take a ton of time if it runs at all. Should I give up? Any suggestion is appreciated!

It is certainly possible, but it can take quite a while. A few comments:

  • There are various tricks when writing Stan code to speed things up. Many are implemented in brms, but depending on the specifics of your data and desired model it is possible that fiddling around in the Stan code itself could provide substantial additional speedup. Likewise, some tasks are especially amenable to computing on the GPU, if you have one (or several) available. If you can post your call to brms and some example code, you might get some help.
  • If your models are relatively simple, then with so much data it is likely that the parameters are identified with very high precision. In this regime, the benefits of exact (up to MCMC error) Bayesian sampling with Stan might be small compared to approximate algorithms that, depending on the details of the implementation, can run much faster (e.g. variational inference in Stan, integrated nested Laplace approximation, or even gasp maximum-likelihood methods).
  • If your models are very complex, such that you actually need millions of data to recover informative parameter estimates, then Stan is likely to be among the fastest general-purpose software to fit your model. Just to help inform your expectations, I’ve recently fit a model with 250K parameters and a non-standard, non-vectorizable likelihood to 2M data points. Doing so on a cluster with 5 cores per chain takes this model about a week.

Thanks! This is very helpful, but apparently, I need to learn a lot more to fully understand. Below is the call in brms for our full model. The DV is a binary variable, 1/0, stress1 and stress2 are time variant predictors, also binary. Year, month, day are categorical. Nothing is centered.

full_model ← brm(data = data,
family = bernoulli(link = “logit”),
formula = success~ 1 + (1+ stress1 + stress2 | person_id) + (1 + stress1 + stress2 | firm_id) + stress1 + stress2 + year + month + day,
prior = prior,
iter = 2000, warmup = 1000, chains = 4)

Does this qualify as a very complicated model? We have access to high performance computing (including GPU), but I need to have some understanding about the resource need first (e.g. how many GPU? How many processors? Memory?) and if any possibility to speed it up from coding aspect. Would centering/vectorization help? Sorry if I’m asking the wrong question!

BTW, when I fit this to a 1% random subset of the data, it took 6.7 hours, with one core per chain.

Any advice is highly appreciated!

I think brms is using both vectorization and centering, you can check the former typing stancode(fit) and checking whether a for loop is involved for the target += part. For that matter, fit can by the way be a model with one iter = 1.

An idea is to change to non-centered mode if you have a lot of data for each group (i.e. stress - person_id combination), which is actually recommended (Diagnosing Biased Inference with Divergences) - this could be done within brms with bf(formula, family, center = FALSE)

@jsocolar and @gkreil are right. I would also add that you may save yourself a lot of trouble by first testing the model against a smaller subset of the data (e.g. just a couple years and firms, or even only a single year/month and eliminate the year/month predictors altogether). This will let you catch bugs/misfit to data much faster. Also, using simple random effects to model time trends might be problematic and autoregressive/random walk/splines/Gaussian processes might be more suitable (although you have a lot of data, so this is less of a worry)

Best of luck with your model!

1 Like

There might be another option depending on the structure of your data. If you have a small number of unique rows in your data (relative to 10 million) then you can speed up the evaluation of the likelihood function. The year + month + day might make this impossible though. Do you know how many unique rows you have?

How big are the person_id and firm_id random effects vectors?

Thanks! I’ll try the non-centered mode to see if any speed gain. And definitely need to get some Stan code training…

Thanks! Indeed, I’ve been testing the model with like, 1%, or 2% random sample of individuals, which still get me 100,000 or 200,000 observations. We actually only have two years, slightly over 1000 firms, but >20,000 people and then daily observation…that’s how the data just blow up…I also tried to build the model from the most parsimonious (intercept only) to this full model. With 3000 iterations, convergence and effective size and estimates all seem good. I’d love to know more about autoregressive/random walk/splines/Gaussian processes, can you point me to more resources? Thanks!

Almost 4 million unique rows. Does this sound small? Thanks!

If I understand you correctly, we have over 20,000 person_id, and slightly over 1000 firm_id, so that’s the size of random effect vectors. I mean each id is associated with one random effect estimate, right? Sorry I’m new to these models!

Unfortunately that’s far too many to get any meaningful speedup with what I had in mind. The idea was that the sum

\sum_{i=1}^N \log\bigg( \frac{1}{1 + e^{-(X\beta)_i}}\bigg)

shows up in the evaluation of the log likelihood, or something close to it. If N is really large, but there are relatively few unique terms in the sum, then you can speed up the evaluation of that sum.

For Gaussian processes, stuff by Aki is great: Bayesian workflow book - Birthdays has work specifically on time trends on several scales while Gaussian process demonstration with Stan is a nice intro.

For others I don’t have a great recommendation and you’ll have to Google yourself :-)

brms has some support for splines (s), gaussian processes (gp) and auto-regressive models (ar, arma).

Best of luck!

How many data points do you have for a typical person_id. Obviously the mean is about 10M/20K, but what’s the median, max, and min?

Thanks. Sorry I was caught up in another project.
So data points for person_id:
min = 12, max = 1250, median = 440
and yes mean is about 550.

An update is, I’m running it on a cluster computer, but it seems no matter how many cores or memories I request, they are not very helpful with the computing speed.

The parallelization built into brms does not scale well to models with large numbers of parameters (you’ve got over 20K parameters). If you’re up to the challenge, it is possible to delve into the raw Stan code to rewrite the parallelization in reduce_sum to slice over the parameter vector rather than over the data (@paul.buerkner correct me if I’m wrong and brms already does this), which might enable better scaling to large numbers of cores. But even this won’t necessarily bring you a big speedup, and it might be a lot of work to implement.

Also, it is very probable that you’re best off using centered parameterizations for random effect levels with thousands of observations. On the other hand, it might be preferable to use non-centered parameterizations for levels with tens of observations. Again, if you want to dive into the Stan code, you could use the centered parameterization for some of the levels and the non-centered parameterization for others. But again, this might represent a big investment of time/work with no guarantee that it will yield meaningful improvement.

Thanks! This is what I was expecting. Unfortunately I don’t have that much time to spend on this, though I’d love. Right now I’m testing with small randomly sampled subsets of the data (100k observations each), and hopefully with a number of them we’ll be able to draw some conclusions. The annoying thing is although linear mixed model or generalized linear mixed model could work reasonably fast with the data, they don’t converge well. But I guess with this amount of data, the results should be not too much different?

Sounds good. Do think carefully about whether random samples should be constructed by leaving out rows or leaving out entire levels of the random effects. Also, it might be useful to see what you can achieve with variational inference on the full model using brms. If you are able to subset the dataset, do exact inference, and recover posterior estimates that are very similar to what you get using variational inference on the full model, then I think you’d have a solid argument for using the variational posterior directly.

Thank you! This sounds promising, though I need to read into variational inference. Not sure about leaving out entire levels of the random effects–we kind of want to compare which level is more important.

1 Like

I don’t think it’s possible currently.

  1. 10M observations is a big deal even for a non-bayesian regression. One has to carefully design the algorithm to utilize parallelization.
  2. Don’t forget that Bayesian estimation use samples to do inference. If there are too many parameters, dimension of parameter space will be very high and sample efficiency may be bad.
  3. I doubt the necessity of Bayesian regressions if you have so many observations. Due to the asymptotic consistency, you should get similar results for Bayesian and ML estimations.