Tips for speeding up a model fit...I cut it off at 10 days

Hello!

I’m new to Stan / BRMS and have hit a bit of a snag. I’m trying to model data where the outcome variable of interest is the number of tries a participant takes to get a particular question/item right (out of a maximum of 26 tries), and we’d like to investigate the effect of having been exposed to the type of material the item is drawn from before. Each participant is presented with either 38 or 58 items, and each item is drawn from a different subject to which the subject has either been exposed or not. There are about 4500 items exposed to 100 subjects, with some items being used more than once but not a lot. Probably about 2700 unique items, 4500 rows of data.

Initially I tried modeling it this way, in a jupyter notebook. df_long has the following relevant columns:

  • num_guesses is the outcome variable of interest, and it is an integer between 1 and 26

  • Exposed_to_subject is an integer betwee
    df_long_for_stan_forum.csv (3.3 KB)
    n 0 and 100, where 0 means “has not been exposed to the subject matter” and an integer n between 1 and 100 means “was exposed to the subject matter n days ago”.

  • Model_ans_prob is a language model’s attempt at the task (number of guesses it took the language model to get it right), which we hope will pull out a lot of variance that isn’t due to Exposed_to_subject, because the items vary greatly in difficulty (some are no-brainers, some you will never get without methodically going through a lot of the possible 26 answers). Exposed_to_subject is what we’re really interested in as a predictor of num_guesses. It is a float between 0 and 1.

  • ppt.code is an alphanumeric code that is unique to each participant.

The reason we are doing it this way is the data is sparse. We would like an estimate for each participant of what the probability of getting it right in 1 try, 2 tries, 3… all the way to 26 for both exposed and not-exposed. This distribution is power law / Zipfian, with there being many more right-in-1-tries than right-in-26-tries.

Problem is, answering only 58 items max per participant you’re only going to get a subset of all 26 probabilities (each participant has a bunch of 1s (getting right on the first try), maybe a few 2s, missing 3 entirely, has a few 4s, etc.). But we DO have a lot of information on how people perform overall on a given type of item, exposed or not exposed. It’s easy to build a 1-26 tries curve of exposed / not exposed, since we have a large number in either category. Hence the smoothing.

So, here was the first attempt:

%%R -i df_long
# make sure rpy2 was loaded using %load_ext rpy2.ipython

library(cmdstanr)

# Read library
require(brms)

# Make sure Exposed_to_subject is treated as factor
df_long$Exposed_to_subject<-as.factor(df_long$Exposed_to_subject)

# Model itself
bay_model<-brm(num_guesses|trunc(ub=26)~
                  s(Model_ans_prob)+
                  Exposed_to_subject+
                  s(Model_ans_prob,by=Exposed_to_subject)+
                  (1|ppt.code),
                  family="poisson",
                  control=list(adapt_delta=0.99,
                                max_treedepth=14),
                  iter=4000,
                  warmup=1000,
                  cores=16, chains = 4,#  opencl = opencl(c(0, 0)),
                  init = "0",
                  refresh=0,
                  backend="cmdstanr",
                  data=df_long,
                  save_model="bay_model"
               # cpp_options = list("STAN_OPENCL"=TRUE)
               )

This ran for ten days, and then I decided to pull the plug (having a hot laptop with fans screaming is no fun).

Here is the second try (running now, approaching two days). I factorized ppt.code as well as making Exposed_to_subject binary, so instead of “how many years ago were you exposed?” it is now just, “have you been exposed or not?” I’m also not running it in jupyter / from python now, just doing it in RStudio.

df_long <- read.csv("/some_csv.csv", header=TRUE, stringsAsFactors=FALSE)

library(cmdstanr)

# Read library
require(brms)

# Make sure Exposed_to_subject_binary is treated as factor
df_long$Exposed_to_subject_binary<-as.factor(df_long$Exposed_to_subject_binary)

# Make sure ppt.code is treated as factor
df_long$ppt.code<-as.factor(df_long$ppt.code)


# Model itself
bay_model<-brm(num_guesses|trunc(ub=26)~
                 s(Model_ans_prob)+
                 Exposed_to_subject_binary+
                 s(Model_ans_prob,by=Exposed_to_subject_binary)+
                 (1|ppt.code),
               family="poisson",
               control=list(adapt_delta=0.99,
                            max_treedepth=14),
               iter=4000,
               warmup=1000,
               cores=16, chains = 4,#  opencl = opencl(c(0, 0)),
               init = "0",
               refresh=0,
               backend="cmdstanr",
               data=df_long,
               save_model="bay_model"
               # cpp_options = list("STAN_OPENCL"=TRUE)
)

This is taking a really long time on a linux laptop with an i7-12700H processor. What can I do to speed it up? Have I mis-specified or made any obvious mistakes?

Some ideas that occur to me:

  1. Specify the model better somehow

  2. Specify such that I’m making better use of the computing resources I’ve got. I spent some time looking in to parallelization, and found this, but it’s not clear that’ll be a huge speedup. I tried to implement GPU utilization with an A100 in google CoLab, but ran into some errors. It’s not even clear to me that the massive parallelization something like an A100 offers will be of any use.

  3. Pay for compute resources somewhere and just let it run. Is there somewhere pretty easy to do this? I’m happy to pay for it, I just don’t know where makes sense. Google CoLab will kill your sessions after a while (even with the “Pro” subscription), and from what I could figure out the CPUs you get in a CoLab session aren’t amazing…it looked like my laptop’s CPU was better.

  4. My CPU has 20 total threads, with 6 performance cores and 8 efficient cores. Did I do something bad by specifying 16 cores and 4 chains? Maybe the performance cores are being held back by a lagging efficient core? Maybe 16 is too many, since I only have 14 physical cores?

  5. Is it possible somehow to pause or stop the model fit and save progress? The problem becomes more tractable if my laptop doesn’t have to be a screaming beast all the time…!

Thank you in advance for the help, and if there’s anything I can supply please ask. The attached is 95 rows of the ~4500 row data file.

Hello @DetentBananaFireball, I suspect that your specification of the smooth terms is the reason that this model is very expensive. Particularly the term

s(Model_ans_prob,by=Exposed_to_subject)

implies a thin-plate regression spline of Model_ans_prob, for each level of Exposed_to_subject, which if valued on 0-100 could be a very large number of smoothers, probably supported by very little information. Based on your description I suspect you may prefer a thin-plate regression spline of both those variables, i.e.

s(Model_ans_prob,Exposed_to_subject)

If you haven’t already, it would probably be reasonable to build and test as much of your model without the smooth terms as possible. That response model seems rather complicated and without a careful workflow it would be easy to misspecify it. There’s also the issue of priors, which may not be very useful for smooth terms but surely there is information you can supply regarding the other parameters.

4 Likes

Thank you for sharing your thoughts @AWoodward , I really appreciate it. It’s so helpful when getting started with something like this.

I can understand what you mean about the smoothing functions. I would point out that Exposed_to_subject is a float between 0 and 1 and not 0-100, but I’m not sure that would make much of a difference. EDIT: I was thinking of Model_ans_prob, which is a float [0,1]. Exposed_to_subject is 0-100 integers, you’re correct.

  1. I do have a strong prior (in my head) about what num_guesses will look like (power law / Zipfian), and Model_ans_prob will be distributed similarly (although it’s between 0 and 1 continuous right now, but I could bin it to 1-26 if that would help). I know the shape of these distributions to a pretty high degree of certainty because these phenomena have been measured empirically before. How might I specify priors here?

  2. I’m also happy to throw compute resources at this if it’s possible to buy some that will actually help…any ideas on that?

1 Like

Along with @AWoodward’s advice, I noticed that (along with adapt delta) your max_treedepth is 14, which is quite high. Compute time roughly doubles each time you increase tree depth by 1. You can start with a lower tree depth to get some idea of what your model is doing.

2 Likes

Thanks, this is good to know. I’ll cut it way down when testing.