Improving the efficiency of fitting a model using stan_lmer

Hi everyone. I’m a new rstanarm user, trying to fit the following to estimate the gender wage gap:

stan_lmer(formula = income ~ male + seniority + ( 1 + male + seniority | job), data = data)

The dataset contains ~150K observations over nearly a decade, across ~1600 jobs, among which just 400 jobs contain more than 90% of the observations, and many contain only a few, often even a single observation.

Upon running the chunk, I receive the following:

SAMPLING FOR MODEL ‘continuous’ NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.083617 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 836.17 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)

Leaving it running for more than half a day, it still doesn’t get beyond that
“Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)” stage, so I’m hoping to receive advice on whether I may configure something differently to fit the model (at all, not to have high expectations about how long it may take).

Any additional opinions on whether to stick to the default weakly informative priors and, if not, how to begin thinking about setting the appropriate priors will also be much appreciated.

Thank you for any help and guidance,
T

  • Operating System: OS X 10.15.1
  • rstanarm Version: binary: 2.19.2 source: 2.19.3

I would first simulate some fake data that is also smaller than the actually data set. Check to see if the model recovers the known fake parameters. You can also see how long it takes on a few 100 to few thousand data points.

After that you can play around with priors. For instance in the literature what is the impact of being male on salary? Is it positive, negative, or unknown? If it’s slight positive but sometimes near 0 you might set a prior of normal(0.5, 0.5).

2 Likes

What units is income in? People usually do such models on log-income.

Thank you for the suggestion.

Income is left in dollars, because the lmer fit of the model with the logarithmically transformed income results in a violation in the assumption of normally distributed residuals, and leaving income “unlogged” still results in easily interpretable output.

Wen I try fitting
stan_lmer(formula = log(income) ~ male + seniority + ( 1 + male + seniority | job), data = data),
I get:

Error in if (a == Inf) b else a : missing value where TRUE/FALSE needed
|6.|prior_scale_for_aux %ORifINF% 0

Thank you for the suggestions. On a random subsample of 5000 observations, the stan_lmer output comes quite close to the average slope and s.e. estimates of the original lmer run. Not so close for the run on 1000 and 100 observations but, more importantly, in all of those cases, there is at least 1 divergent transition after warmup (more with a larger subsample), and increasing adapt_delta does not make the problem vanish. The pairs plot does not seem to produce results even on the stan_lmer fit on 100 observations.

If you feel up to delving into coding in Stan directly, you might get a speed up by using the trick posted here. The mapping from that example’s terminology to yours is subject==job, you’ll want to use the formula “~ male+seniority” for the W matrix and “~1” for the B matrix.

I can’t recall if stan_lmer defaults to a centered or non-centered parameterization, but the example uses non-centered. You might also check if your W matrix has much correlation, and if so use the QR reparameterization from the manual (which I do think stan_lmer applies by default).

2 Likes