How to speed up a process?


#1

I’m trying to run a model as follows:

require(‘rstanarm’)
options(mc.cores = 32)
fm <- stan_lmer(Value ~ 1 + pulse + (1|Sbj)+(1+pulse|Path), data=dat, chain=32)

with a large dataset:

str(dat)
‘data.frame’: 459360 obs. of 4 variables:
$ Sbj : Factor w/ 44 levels “14”,“15”,“20”,…: 1 1 1 1 1 1 1 1 1 1 …
$ Path : Factor w/ 10440 levels “Seed_10-100”,…: 2925 2945 2956 2967 2977 2986 2851 2862 2873 2884 …
$ Value: num 0.679 0.5819 0.2531 0.0469 1.2375 …
$ pulse : num 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 …

I see messages like the following:

1000 transitions using 10 leapfrog steps per transition would take 7400 seconds.

This is the first time I use the package. So, here are my questions:

  1. How to guesstimate the runtime based on the above message?

  2. Is there any room to speed up the process to some extent?

  3. Any suggestions on the priors in this case?

  4. Do I have to perform standardization (remove the mean and scale by the standard deviation) for all the variables? The reason I’m asking this is that I used to read somewhere that standardization may help in simplifying the interpretation of the results. Is this the case with rstanarm?

Your help is highly appreciated!


#2
  1. About a month
  2. I would start with stan_lmer(Value ~ 1 + pulse + (1|Sbj)+(1|Path), data=dat). You don’t need 32 chains. That will run reasonably quickly.
  3. No one besides you knows enough about this data-generating process to say, but the only one that matters is on the standard deviation of the intercept shifts across levels of Path.
  4. I would not divide variables by their empirical standard deviation. It is centered internally and shifted back in the output, so you don’t have to worry about that. The prior on pulse is by default a function of its standard deviation. Only divide by constants so that the parameters have reasonable units.

#3

Thanks a lot for the quick help, Dr. Goodrich!

  1. About a month

That’s quite depressing…

I did a test run with 1% of the data and with 4 cores. It took about 7 minutes.

  1. I would start with stan_lmer(Value ~ 1 + pulse + (1|Sbj)+(1|Path), data=dat). You don’t need 32 chains. That will run reasonably quickly.

I did that for a testing run. However, I would like to say something about the ‘pulse’ effect for each path, and that was the reason I wanted to add the random effects for ‘pulse’ with my original model:

fm <- stan_lmer(Value ~ 1 + pulse + (1|Sbj)+(1+pulse|Path), data=dat, chain=32)

In other words, I’d like to have something like the following in the output:

Estimates:
mean sd 2.5% 25% 50% 75% 97.5%

b[pulse Path:Seed_10] …

Please correct me if you’re wrong.

So, with my original model, there is no hope that I would be able to get it done within a realistic time frame?

  1. No one besides you knows enough about this data-generating process to say, but the only one that matters is on the standard deviation of the intercept shifts across levels of Path.

Is that specified through “prior_covariance”? What would be a good choice other than the default (normal?)?

  1. I would not divide variables by their empirical standard deviation. It is centered internally and shifted back in the output, so you don’t have to worry about that. The prior on pulse is by default a function of its standard deviation. Only divide by constants so that the parameters have reasonable units.

Thanks for the clarification!


#4

Could Path possibly be the combination of multiple other variables? If so, you might be able to speed things up by separating them out and treating them as separate effects.


#5

There is a chance that it gets done in a week or so. It all depends on the geometry. The prior_covariance takes the output of the decov() function, but the default of an exponential prior on the across-groups standard deviation is as good as any.