Hi, see code below. It ran fine. But then I re-ran setting sigma_y = 0.01 and it ran really slowly, it must have been doing 1024 steps per iteration. I’m guessing the problem was the centered or non-centered parameterization. We should do something about this, no?
Andrew

2. Simulate a data structure with N_per_person measurements on each of J people

J <- 50 # number of people in the experiment
N_per_person <- 10 # number of measurements per person
person_id <- rep(1:J, rep(N_per_person, J))
index <- rep(1:N_per_person, J)
time <- (index - 1)/(N_per_person - 1) # time of measurements, from 0 to 1
N <- length(person_id)

Simulate data from a between-person experiment

z <- sample(c(0,1), J, replace=TRUE)
a <- rnorm(J, 0, 1)
b <- rnorm(J, 1, 1) * 0
theta <- 0.1
sigma_y <- 1
y_pred <- a[person_id] + b[person_id]time + thetaz[person_id]*time
y <- rnorm(N, y_pred, sigma_y)

Yeah, currently the only option is the non-centered parameterization, which rarely has divergences with stan_glmer but can definitely be slower when sigma_y -> 0. @bgoodri What do you think about allowing the user to specify that the centered parameterization be used?

Indeed. But I’m not sure where things stand on that. Last time I checked I think we had somewhat arrived at a consensus that we couldn’t find the right point to switch without trying both ways, right?

I get 57.5 seconds with sigma_y = 0.1. But the bigger problem is that lme4 formula parser does not output design matrices for the centered parameterization.

Yes, I’d imagine that sigma=0.1 would be a lot faster than sigma=0.01. When I wrote that last email, I re-ran it at 0.01, and so far it’s only done 10% of each of the 4 chains.

Is this all part of your relax from specified to very tight prior?

Those tightly scaled priors aren’t so good for Stan sampling without rescaling.

Running for fixed wall time won’t help—it’ll run for your patience limit then report that it failed.

In general, we can’t just try all combos of centered and non-centered for every grouped parameter because that’s \mathcal{O}(2^K) options with K groups. I’d guess we wouldn’t really need all \mathcal{O}(2^K) runs to figure it out, though. But this kind of ad-hoc logic inside a model is very difficult to implement and can lead to some very hard-to-maintain and hard-to-modify code.

Hi, no, this is not a model that I would typically fit. It’s kind of a weird model, an edge case. Still, we want Stan and its affiliate programs to behave well even in edge cases.

Regarding “we can’t just try all combos of centered and non-centered for every grouped parameter because that’s \mathcal{O}(2^K) options with K groups,” I do have an idea, actually a research idea:

We can do experimentation. Suppose our model has K variance parameters (for simplicity, assume they’re scalar variance parameters; I’ll outsource all multivariate thinking to Ben). We can do iterations randomly setting each parameter to centered or non-centered, then when we have a bunch of data, we could do something like a hedonic regression to estimate, for each parameter, whether centered or non-centered makes things run faster.

Another option would be to set the parameters as centered or non-centered, one at a time. Like picking a lock.

That might work. There are a lot of algorithms that do things like this these days, so there are even analysis techniques if we can find someone who wants to do that kind of theory.

I think the theory is easy on this one. You can mix MC kernels randomly as long as the allocation is in no way dependent on the chain.

Also XL has a paper about this a while back. Some experiments on GPs that i can’t find right now showed that it didn’t really help in most cases, so you’re doubling your work for very little benefit.

I suspect I am running into a similar problem with a hierarchical logistic regression that is extremely unbalanced (< 1% of the binary labeled outcome is true). It sounds like steps per iteration could help diagnose the problem? How can I calculate how to calculate it?

My guess is that it’s not the unbalancedness, but the fact that one class has a small number of observations and then it’s more likely that the problem is separable and you need informative prior. So it’'s the same problem as with balanced classes but small number of observations in both groups.