I’m new to Stan / BRMS and have hit a bit of a snag. I’m trying to model data where the outcome variable of interest is the number of tries a participant takes to get a particular question/item right (out of a maximum of 26 tries), and we’d like to investigate the effect of having been exposed to the type of material the item is drawn from before. Each participant is presented with either 38 or 58 items, and each item is drawn from a different subject to which the subject has either been exposed or not. There are about 4500 items exposed to 100 subjects, with some items being used more than once but not a lot. Probably about 2700 unique items, 4500 rows of data.
Initially I tried modeling it this way, in a jupyter notebook. df_long has the following relevant columns:
num_guesses is the outcome variable of interest, and it is an integer between 1 and 26
Exposed_to_subject is an integer betwee
df_long_for_stan_forum.csv (3.3 KB)
n 0 and 100, where 0 means “has not been exposed to the subject matter” and an integer n between 1 and 100 means “was exposed to the subject matter n days ago”.
Model_ans_prob is a language model’s attempt at the task (number of guesses it took the language model to get it right), which we hope will pull out a lot of variance that isn’t due to Exposed_to_subject, because the items vary greatly in difficulty (some are no-brainers, some you will never get without methodically going through a lot of the possible 26 answers). Exposed_to_subject is what we’re really interested in as a predictor of num_guesses. It is a float between 0 and 1.
ppt.code is an alphanumeric code that is unique to each participant.
The reason we are doing it this way is the data is sparse. We would like an estimate for each participant of what the probability of getting it right in 1 try, 2 tries, 3… all the way to 26 for both exposed and not-exposed. This distribution is power law / Zipfian, with there being many more right-in-1-tries than right-in-26-tries.
Problem is, answering only 58 items max per participant you’re only going to get a subset of all 26 probabilities (each participant has a bunch of 1s (getting right on the first try), maybe a few 2s, missing 3 entirely, has a few 4s, etc.). But we DO have a lot of information on how people perform overall on a given type of item, exposed or not exposed. It’s easy to build a 1-26 tries curve of exposed / not exposed, since we have a large number in either category. Hence the smoothing.
So, here was the first attempt:
%%R -i df_long # make sure rpy2 was loaded using %load_ext rpy2.ipython library(cmdstanr) # Read library require(brms) # Make sure Exposed_to_subject is treated as factor df_long$Exposed_to_subject<-as.factor(df_long$Exposed_to_subject) # Model itself bay_model<-brm(num_guesses|trunc(ub=26)~ s(Model_ans_prob)+ Exposed_to_subject+ s(Model_ans_prob,by=Exposed_to_subject)+ (1|ppt.code), family="poisson", control=list(adapt_delta=0.99, max_treedepth=14), iter=4000, warmup=1000, cores=16, chains = 4,# opencl = opencl(c(0, 0)), init = "0", refresh=0, backend="cmdstanr", data=df_long, save_model="bay_model" # cpp_options = list("STAN_OPENCL"=TRUE) )
This ran for ten days, and then I decided to pull the plug (having a hot laptop with fans screaming is no fun).
Here is the second try (running now, approaching two days). I factorized ppt.code as well as making Exposed_to_subject binary, so instead of “how many years ago were you exposed?” it is now just, “have you been exposed or not?” I’m also not running it in jupyter / from python now, just doing it in RStudio.
df_long <- read.csv("/some_csv.csv", header=TRUE, stringsAsFactors=FALSE) library(cmdstanr) # Read library require(brms) # Make sure Exposed_to_subject_binary is treated as factor df_long$Exposed_to_subject_binary<-as.factor(df_long$Exposed_to_subject_binary) # Make sure ppt.code is treated as factor df_long$ppt.code<-as.factor(df_long$ppt.code) # Model itself bay_model<-brm(num_guesses|trunc(ub=26)~ s(Model_ans_prob)+ Exposed_to_subject_binary+ s(Model_ans_prob,by=Exposed_to_subject_binary)+ (1|ppt.code), family="poisson", control=list(adapt_delta=0.99, max_treedepth=14), iter=4000, warmup=1000, cores=16, chains = 4,# opencl = opencl(c(0, 0)), init = "0", refresh=0, backend="cmdstanr", data=df_long, save_model="bay_model" # cpp_options = list("STAN_OPENCL"=TRUE) )
This is taking a really long time on a linux laptop with an i7-12700H processor. What can I do to speed it up? Have I mis-specified or made any obvious mistakes?
Some ideas that occur to me:
Specify the model better somehow
Specify such that I’m making better use of the computing resources I’ve got. I spent some time looking in to parallelization, and found this, but it’s not clear that’ll be a huge speedup. I tried to implement GPU utilization with an A100 in google CoLab, but ran into some errors. It’s not even clear to me that the massive parallelization something like an A100 offers will be of any use.
Pay for compute resources somewhere and just let it run. Is there somewhere pretty easy to do this? I’m happy to pay for it, I just don’t know where makes sense. Google CoLab will kill your sessions after a while (even with the “Pro” subscription), and from what I could figure out the CPUs you get in a CoLab session aren’t amazing…it looked like my laptop’s CPU was better.
My CPU has 20 total threads, with 6 performance cores and 8 efficient cores. Did I do something bad by specifying 16 cores and 4 chains? Maybe the performance cores are being held back by a lagging efficient core? Maybe 16 is too many, since I only have 14 physical cores?
Is it possible somehow to pause or stop the model fit and save progress? The problem becomes more tractable if my laptop doesn’t have to be a screaming beast all the time…!
Thank you in advance for the help, and if there’s anything I can supply please ask. The attached is 95 rows of the ~4500 row data file.