So I’m having an issue with stan_lmer. I’m trying to run a model with about 200k observations and around 435 random intercepts. The exact model is of the form
stan_lmer(y ~ X + (1|id))
The model doesn’t progress after hours. I’ve tried running the model on a subset of the data. If I choose samples of 1,000, 5,000, or 10,000, it runs pretty quickly. Once I get above 10,000, it chugs.
For something that simple yet with such large data, you probably want to use cmdstanr and write the model directly:
data{
int nXY; //number of observations in X/Y
vector[nXY] Y ; // Y (outcome) observations
vector[nXY] X ; // X (covariate) observations
int nZ ; // number of levels in Z
int<lower=1,upper=nZ> Z[nXY] ; // Z-label for each outcome in Y
// centered: binary toggle for intercept centered/non-centered parameterization
int<lower=0,upper=1> centered ;
}
parameters{
real<lower=0> noise ; //measurement noise scale
real betaX ; // effect of the covariate X
real mu ; // intercept
real<lower=0> sigma ; // scale of variability among levels of Z
vector<
offset = (centered ? 0 : mu)
, multiplier = (centered ? 1 : sigma)
>[nZ] Zval ; // value for each level of Z
}
model{
// Priors
noise ~ weibull(2,1) ; //must be changed to reflect domain expertise
betaX ~ std_normal() ; //must be changed to reflect domain expertise
mu ~ std_normal() ; //must be changed to reflect domain expertise
sigma ~ std_normal() ; //must be changed to reflect domain expertise
//hierarchical prior for Zval:
zVal ~ normal( mu, sigma ) ;
//likelihood (can be sped up via reduce_sum if lots of cores available!)
Y ~ normal( zVal[Z] + betaX*X, noise ) ;
}
As noted in the final comment, if you have lots of CPU cores available, you can parallelize the final likelihood computation using the reduce_sum framework.
At the moment brms is likely to scale to larger datasets a little better than rstanarm, especially if you use the cmdstanr backend.
Additionally, with the cmdstanr backend you can use within-chain parallelisation via the threads argument. Also, if you have a discrete GPU in your system you can use this to accelerate your models
I haven’t noticed much difference between brms with cmdstanr vs rstan, but I have found using threads=threading(2) tends to slow down my brm(y ~ a + b + c + (1|id), chains=4, cores=4) sampling by a factor of 2-3 on a 12-physical-core Ryzen.
Is this thread the best reference on using GPU parallelism?
What is the main difference between writing that model manually and using brms to generate one, such that manually writing it would make it perform faster?
Nothing directly, but you have more control (ex. for the OPs data, they probably want a “centered” parameterized model, and I’m not sure that brms makes it easy to switch from non-centered, which I presume is its default). I’m probably just biased by knowing raw Stan well and brms not-well.
with n>10k (or >100k) in practice (coded directly in Stan or otherwise)? It seems like it would take days or weeks, at least using the techniques I’ve tried so far (namely cores=4, threads=threading(2), and noncentered parametrization).
That’s about the scale of a dataset I’m currently working on, except I’m doing:
x ~ (a+b+c+d+e+f+g)^3 + (1+(a+b+c+d+e+f+g)^3 | q | id)
y ~ (a+b+c+d+e+f+g)^3 + (1+(a+b+c+d+e+f+g)^3 | q | id)
z ~ (a+b+c+d+e+f+g)^3 + (1+(a+b+c+d+e+f+g)^3 | q | id)
But aside from over-regularization of the correlations that I’m still working on, it at least fits in a few hours. Mind you, all my variables are dichotomous and thereby benefit greatly from the reduced redundant computation trick I linked above.
Oh, and I guess it helps that while the raw number of observations is in the 30k+ range, there are only 60 levels of id.