Stan_glmer slow when data variance is really small

andrewgelman · March 27, 2018, 2:41am

Hi, see code below. It ran fine. But then I re-ran setting sigma_y = 0.01 and it ran really slowly, it must have been doing 1024 steps per iteration. I’m guessing the problem was the centered or non-centered parameterization. We should do something about this, no?
Andrew

1. Set up

setwd("/Users/andrew/AndrewFiles/teaching/multilevel_course")
library(“rstan”)
library(“rstanarm”)
library(“arm”)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)

2. Simulate a data structure with N_per_person measurements on each of J people

J <- 50 # number of people in the experiment
N_per_person <- 10 # number of measurements per person
person_id <- rep(1:J, rep(N_per_person, J))
index <- rep(1:N_per_person, J)
time <- (index - 1)/(N_per_person - 1) # time of measurements, from 0 to 1
N <- length(person_id)

Simulate data from a between-person experiment

z <- sample(c(0,1), J, replace=TRUE)
a <- rnorm(J, 0, 1)
b <- rnorm(J, 1, 1) * 0
theta <- 0.1
sigma_y <- 1
y_pred <- a[person_id] + b[person_id]time + thetaz[person_id]*time
y <- rnorm(N, y_pred, sigma_y)

3. Fit model using rstanarm

z_full <- z[person_id]
fake_data_1 <- data.frame(time, person_id, z_full, y)
fit_1 <- stan_glmer(y ~ (1 + time | person_id) + time + z_full:time, data=fake_data_1)

jonah · March 27, 2018, 2:50am

Yeah, currently the only option is the non-centered parameterization, which rarely has divergences with stan_glmer but can definitely be slower when sigma_y -> 0. @bgoodri What do you think about allowing the user to specify that the centered parameterization be used?

andrewgelman · March 27, 2018, 2:54am

or better still if automatic!

jonah · March 27, 2018, 3:07am

Indeed. But I’m not sure where things stand on that. Last time I checked I think we had somewhat arrived at a consensus that we couldn’t find the right point to switch without trying both ways, right?

andrewgelman · March 27, 2018, 3:09am

Trying both ways automatically and then picking the best, that would be an option. We can discuss at Stan meeting, perhaps!

bgoodri · March 27, 2018, 4:36am

I think the example takes less than 20 seconds.

andrewgelman · March 27, 2018, 4:41am

I just ran it on my laptop (running 4 chains in parallel) and it took a lot longer than 20 seconds (after I set sigma_y to 0.01).
A

bgoodri · March 27, 2018, 4:48am

I get 57.5 seconds with sigma_y = 0.1. But the bigger problem is that lme4 formula parser does not output design matrices for the centered parameterization.

andrewgelman · March 27, 2018, 4:50am

Yes, I’d imagine that sigma=0.1 would be a lot faster than sigma=0.01. When I wrote that last email, I re-ran it at 0.01, and so far it’s only done 10% of each of the 4 chains.

bgoodri · March 27, 2018, 5:07am

I misread sigma_y at first, but 0.01 does not seem very realistic.

jonah · March 27, 2018, 6:17pm

Oh right, I forgot about that.

andrewgelman · March 27, 2018, 9:02pm

I agree, not very realistic at all. Still, I don’t like it when the program takes forever in what would seem to be a simple problem.

Soon it seems we’ll have the “run for a fixed amount of wall time” option, and that will help a lot.

Bob_Carpenter · April 1, 2018, 6:32pm

Is this all part of your relax from specified to very tight prior?

Those tightly scaled priors aren’t so good for Stan sampling without rescaling.

Running for fixed wall time won’t help—it’ll run for your patience limit then report that it failed.

In general, we can’t just try all combos of centered and non-centered for every grouped parameter because that’s \mathcal{O}(2^K) options with K groups. I’d guess we wouldn’t really need all \mathcal{O}(2^K) runs to figure it out, though. But this kind of ad-hoc logic inside a model is very difficult to implement and can lead to some very hard-to-maintain and hard-to-modify code.

andrewgelman · April 2, 2018, 1:48am

Hi, no, this is not a model that I would typically fit. It’s kind of a weird model, an edge case. Still, we want Stan and its affiliate programs to behave well even in edge cases.

Regarding “we can’t just try all combos of centered and non-centered for every grouped parameter because that’s \mathcal{O}(2^K) options with K groups,” I do have an idea, actually a research idea:

We can do experimentation. Suppose our model has K variance parameters (for simplicity, assume they’re scalar variance parameters; I’ll outsource all multivariate thinking to Ben). We can do iterations randomly setting each parameter to centered or non-centered, then when we have a bunch of data, we could do something like a hedonic regression to estimate, for each parameter, whether centered or non-centered makes things run faster.

Another option would be to set the parameters as centered or non-centered, one at a time. Like picking a lock.

A

Bob_Carpenter · April 6, 2018, 7:54pm

That might work. There are a lot of algorithms that do things like this these days, so there are even analysis techniques if we can find someone who wants to do that kind of theory.

anon75146577 · April 6, 2018, 8:19pm

I think the theory is easy on this one. You can mix MC kernels randomly as long as the allocation is in no way dependent on the chain.

Also XL has a paper about this a while back. Some experiments on GPs that i can’t find right now showed that it didn’t really help in most cases, so you’re doubling your work for very little benefit.

http://www.stat.harvard.edu/Faculty_Content/meng/jcgs.2011-article.pdf

rsteckel · April 6, 2018, 9:10pm

I suspect I am running into a similar problem with a hierarchical logistic regression that is extremely unbalanced (< 1% of the binary labeled outcome is true). It sounds like steps per iteration could help diagnose the problem? How can I calculate how to calculate it?

andrewgelman · April 6, 2018, 9:13pm

Perhaps it would help to use informative priors.

andrewgelman · April 8, 2018, 5:31am

I can try to write this up as a conjecture, then maybe a PhD student can do it as a research project!

avehtari · April 8, 2018, 5:13pm

My guess is that it’s not the unbalancedness, but the fact that one class has a small number of observations and then it’s more likely that the problem is separable and you need informative prior. So it’'s the same problem as with balanced classes but small number of observations in both groups.

Topic		Replies	Views
Stan_glmer model's iterations is running slow General	6	950	February 4, 2021
Stan_glmer speed for large sample sizes rstanarm	4	2105	April 19, 2021
Parallel chains with rstan fails in Rstudio RStan	10	2618	July 5, 2020
Stan_lmer really slow Modeling	10	1639	May 14, 2021
Significant Differences between `rstan` and `rstanarm` running times Modeling rstanarm	1	1525	August 8, 2020

Stan_glmer slow when data variance is really small

1. Set up

2. Simulate a data structure with N_per_person measurements on each of J people

Simulate data from a between-person experiment

3. Fit model using rstanarm

Related topics