Brms in Monte Carlo Simulations?

I’m looking to compare brms and lme4 in a Monte Carlo Simulation with ~ 400 different combination of conditions of 10,000 replications each, so I’m wondering what the best way/examples to do this would be. I don’t know if it’s just me, but I’m not having much success with ‘brms Monte Carlo Simulation’ on Google Scholar. I’m also only familiar with Bayesian statistics from afar and not in-practice.

My first question is, the default brms settings has 4 chains of 2000 iterations each and throws away half of the iterations for warm-up, so that gives 4000 posterior samples–if I understand it correctly. My question is, if I get any convergence warnings, do you think I should set control statements to re-run the model from scratch with an increased number of iterations? (Is there a way to continue progress?) Should I set a limit to the maximum number of iterations?

  • I got the tip that convergence failure is also a result, although that still leaves the question of what the default number of iterations should be or else there’d be no difference between 2 and 2000 iterations.

My second question is, I’m pretty sure that an informative prior should make all brms estimates superior to lme4, but, if I understand correctly, the default flat prior shouldn’t show much or any differences between lme4 and brms. I also heard that a default flat prior is highly undesirable. Should I use a weakly informative prior then? An uninformative shrinkage prior? There just seems to be a lot of choices and given the context, I don’t know if it’s feasible to try all of them.

Thanks,
Michael

I’m not in a good position to answer your first question, but I will confirm that your basic understanding of the brm() defaults for the chains, warmups, and iterations is fine.

As to your second question, I think the answer depends on what you’re trying to do with your simulation. Bayesian models require priors, and any simulation including Bayesian models has to grapple with which kinds of priors you’re interested in exploring. If you’re not sure yet, then I recommend against a simulation study of any kind. Spend some time learning more about Bayesian modeling, instead. Any contemporary introductory Bayesian textbook is fine for this.

Also, it looks like this is your first post. Welcome to the Stan forums, @emstruong!

2 Likes

Thanks for the welcome, Solomon!

I thought about the response I got from others and you, and I think my initial questions might be mis-framed. It’s probably infeasible to try to compare frequentist and Bayesian MLM as a whole through lme4 and brms. This is probably why I can’t find a paper really comparing the two!

However, it’s probably fair to ask the question of, “Given various default settings, how do the lme4 and brms compare? How wary should we be of the default settings in each software?”

So, I think, given the data-generating-process, I could vary the number of iterations and what kind of non-informative priors I use on my parameters.

Then the simulation could tell people that if their problem is analogous to the context of my data generating process, what might be a good bet for a default not-deeply-considered analysis, out of the ‘defaults’ I compare.

Do you think this is fair enough?

That makes sense. With this framing, then I think if and when you get convergence warnings from brm(), you’d want to tally the messages, but not refit the models. I’d be very curious what experienced methodologists would have to say on that, though.

1 Like

For future viewers, it probably would’ve been better to search something on the lines of ‘Stan Monte Carlo Simulation’. Here’s one example that’s analogous to my needs OSF – 2 chains of 1000 post-warmup posterior samples are used each.

They do vary what prior is used, as that’s their research question, but there doesn’t seem to be much, if any, twisting around of the other estimation settings.

It depends what you’re trying to answer—will the model work, or will the model work in some allotted time?

You can fix low ESS and low R-hat with more iterations, but more iterations won’t fix things like divergence warnings—those require reparameterization. The basic principle of Monte Carlo Methods is that error in estimates goes down as \mathcal{O}(1 / \sqrt{N}) in N iterations. There’s a constant factor for MCMC on integrated autocorrelation time (IAT).

What do you mean by “superior” here? If you simulate data according to the model, the Bayesian posterior will recover the parameters with appropriate coverage by construction—that is, for true models, Bayesian posteriors are well calibrated.

lme4 will systematically underestimate uncertainty due to taking a point estimate for the hierarchical prior. So if you try to estimate posterior interval coverage from simulated data, Bayes will be better. How much better will depend on how well identified the hierarchical parameters are (i.e., how little posterior standard deviation there is). If \alpha is the set of hierarchical parameters and \beta is the low level coefficients and our joint posterior is p(\alpha, \beta \mid y) given observed data y, then what lme4 does is roughly to marginalize out the low level parameters and take a maximum marginal likelihood estimate for the hierarchical parameters,

\alpha^* = \textrm{arg max}_\alpha \ p(\alpha \mid y),

where

\displaystyle p(\alpha \mid y) = \int p(\alpha, \beta \mid y) \, \textrm{d}\beta.

This is sometimes called “empirical Bayes”. Then I believe lme4 and similar systems just plug that in to estimate the lower-level parameters,

\beta^* = \textrm{arg max}_\beta \ p(\alpha^*, \beta \mid y),

whereas the Bayesian model will take estimates

\displaystyle \widehat{\alpha}, \widehat{\beta} = \mathbb{E}[\alpha, \beta \mid y] = \int (\alpha, \beta) \cdot p(\alpha, \beta \mid y) \, \textrm{d}(\alpha, \beta).

This accounts for the uncertainty in estimating both \alpha and \beta jonitly—that’s the p(\alpha, \beta \mid y) posterior term, which the integral averages over.

That’s reasonable. What you want to search for is calibration of max marginal likelihood estimation—I don’t know how much has been written on this. The answer is going to depend on how well identified the hierarchical parameters are and whether you only care about estimation error. The lme4 only gives you point estimates and confidence intervals, not Bayesian posteriors—they’re not really comparable as the meanings are different.

This is becoming relevant for us as we’re going to add some approximate marginalization software to Stan soon.

That’s a really good way of putting it, I didn’t think of that. But given that it’s a Monte Carlo Simulation and that the general research question is to investigate what might be the best (thoughtless) ‘default’ behavior that an applied researcher should take in the data-generating-process-context (DGP) I’m making up, I think it should be the ‘allotted time’ option.

I think the general ways of assessing how biased parameter estimates are in MCS, so relative bias, RMSE from population value, etc.

Yes, I also think that the Bayesian models should recover the parameters if I provide the correct model specification, but in my particular DGP, I was curious about whether lme4 could also do it, even with correct model specification. I also wasn’t sure if there might be some back-end computational tricks in either lme4 or brms that make assumptions that might be violated by my particular DGP.

For what it’s worth and hoping that nobody snipes me, I’m interested in what happens when causal effects are confounded by cluster size–the number of level 1 samples in the cluster. I wasn’t sure what would happen and it’s unclear to me what terms to search to find a paper that may have done something similar. So I’m treating brms and lme4 as black-boxes in my investigation and just seeing whether they’ll be able to control for–what is effectively–the (confounding) effect of sample size.

I guess this is my dollar store pre-registration.

Just to clarify, do you mean that maximum marginal likelihood estimation is something that lme4, but not something that brms does; and that a math-stat comparison of the two would be informed by knowing more about maximum marginal likelihood estimation?

I know they’re different, but does the difference still (really) matter when you’re using a flat prior? (I know the variance parameters are not set to use a flat prior in brms, but other than that?)

Oh, do you mean to say that Stan is going to add ways to make computation go faster by trading off accuracy?

In general, parameter estimates won’t be biased with MCMC if the model follows the data generating process. There will be residual error due to variance, but it will be unbiased in the sense that it will have the right expectation across runs. Now this will depend on you using a Monte Carlo method that’s correct for the problem.

lme4 computes something different, so the things they’re trying to recover are different. lme4 is a maximum (marginal) likelihood estimator. Aside from not letting you use priors (other than ones you can simulate by adding fake data), the MLE and posterior mean will be different if the marginal distribution for a parameter is not symmetric.

We try to be friendly and constructive here, so hopefully no sniping :-).

Right. That’s what I was trying to indicate.

It would be, but I don’t think a deep dive in math stats is necessary. I was just providing a hint for how to search for how much is known about the bias introduced by max marginal estimation. The frequentists care about whether their p-values are calibrated. There is no maximum likelihood estimate for hierarchical models because the likelihood isn’t bounded, so there’s only max marginal likelihood like lme4 does.

Yes. lme4 is attempting to estimate confidence intervals whereas brms is attempting to estimate posterior intervals. A confidence interval is telling you about how much the estimator varies under different data sets whereas the Bayesian posterior is telling you how much the parameter varies in the posterior distribution. There’s a section in Gelman’s Bayesian Data Analysis and in a lot of Bayesian writing where they go over situations where you get similar results and where you get different results. Because of the marginalization in lme4, I’m pretty sure lme4 is going to underestimate the width of confidence intervals for the low-level parameters.

I’m not promising anything as no one person is in charge of Stan. But what I meant is that it’s possible to introduce some bias and perhaps lower variance so that you get better expected accuracy per flop/wall-time.

For example, if you apply shrinkage in frequentist estimates (e.g., lasso or ridge regression), it introduces bias relative to the non-shrinkage model, but it improves accuracy in expectation. Same with adding priors in Bayes. So there are good reasons for practitioners who want to minimize error to trade a little bias for a lot of variance reduction.

2 Likes