Possible to lower memory usage?

I’m testing performance of parameter recovery for different models, and using fairly large datasets (although I’m not sure what size of data is considered large in stan), and running out of memory/RAM when fitting.

More specifically, I am testing a 2 arm bandit model from the hBayesDM package. The model I’m using has a slight customization in that the “tau” parameter has a maximum of 20 instead of 5 as in the linked code.

I’ve simulated 1000 subjects using the same model, and each subject does 1000 bandit trials. Each subject has randomly generated values for the two parameters.

I’m doing this to get a decent coverage of the parameter space, as these kinds of models tend to have difficulties finding a good fit for some parameter ranges.

MCMC is infeasible for this dataset (50 subjects with 1000 trials is around 2h on my laptop), so I’m using the variational method.

The fitting process manages to complete, and I get to where the output says it’s drawing samples from the posterior and then it says “COMPLETE” for that process. This is when memory use increases enormously, resulting in a crash.

I’ve tested the same data/model combination in regular rstan, and in cmdstanpy. Same thing happens, so it seems to be a general issue with stan and not an issue with the specific interfaces. I started in Windows, but have also tried in Linux on the same laptop, using a minimal install of Fedora without a desktop environment to maximise the available memory, but got the same issue there.

As the fitting itself looks like it completes, I guess it could have something to do with how stan collates all the data at the end?

For now, I’ve settled for 500 subjects, and that completes to the end, with RAM use maxing out just below my limit.

Am I doing something wrong? Are there any settings I could change that could alleviate the issue?

I guess it’s this in generated quantities

  // For posterior predictive check
  real y_pred[N, T];

which is then 1000x1000 matrix, so it’s 8MB and

How many draws are you asking and how much memory do you have? E.g. 1000 draws would mean y_pred would take 8GB, but during the subsequent operations there can be more than one copy of that 8GB. You could test by dropping generated quantities whether that helps and generate needed quantities separately.

1 Like

Oh, nice catch, that’s indeed the culprit!

I commented out the lines for the y_pred variable in generated quantities. Now memory use is much much lower, and I can run the 1000 subjects data without problems.

Since I have the simulation data to compare with I guess I don’t really need the predictive check here anyway, the loglike is sufficient.

Thank you!

If you don’t mind using cmdstan, it manages this kind of situation better from my experience. Even if R session is aborted after completing a big model, you can still extract the posterior from csv files. I usually specify csv names and the saving folder manually, then extract posteriors using the following code. You can extract parameters of interest according to the task (e.g., loo or posterior predictive simulation) and RAM limitation. This is also faster to load draws comparing the default method when model is big, details can be found here Summary method slow for large models

extract_fit <- function(fit) {
  cmdstanfiles <- list()
  for (f in fit$output_files()) {
    cmdstanfiles[[f]] <- data.table::fread(cmd = paste0("grep -v '^#' --color=never ", f))
  draws_fit <- cmdstanfiles %>%
    bind_rows() %>%
    dplyr::select(-starts_with(c("parameters of interest"))) %>%

Another thing that can help when using cmdstan is using the n_chains and threads arguments for sampling as that will only use one process that uses the same data for all of the threads

1 Like

Oh interesting! I tried in cmdstanpy, and it had the same issue. But you’re saying that as long as it completed the fitting process, the csv files are “complete” so to speak and I can still fetch the data?

Do n_chains and threads work when using variational method? I found MCMC to be way too slow to be useful in this case. But if I do use MCMC, how do you mean exactly? If I have, say, 4 chains, you’re saying if I increase threads then all the chains will share the same data? But if I have 4 chains and increase parallel chains then they use separate data?

1 Like

Just my understanding, cmdstan is saving results progressively to csv files so it is memory efficient. So far, I haven’t had ram issues when fitting the model. Regarding why R session is aborted after fitting is complete, I guess it is because by doing fit <- mod$sample(...), R will load all posteriors to ram, then if it is out or memory, it will abort the session. At least in my case I can still fetch the data from csv files, but one must specify the csv folder when fitting the model, otherwise, they will be saved in the temp folder by default (for R, not sure about Python) and when R session is aborted, temp folder will be deleted. I also specify the csv file names by output_basename . However, I think all of these are bypass solutions, even if we have all posteriors saved, we may still face out of memory issue when playing with them for posterior prediction simulation, etc. This approach is only useful when generated variables like log_lik or predicted draws are taking a lot of space, then one can only extract either posteriors of all parameters or generated variables for desired calculations, and hopefully, they are small enough.


No parallel chains only work with mcmc in cmdstan. How large is your data and parameter sizes? if one or the other is a few hundred thousand then I get using vi but often I can make my mcmc model run much faster by doing simpler things like making semi-informative priors.

In cmdstan, if you set the n_chains and threads argument to say 4 or something then a single process will startup that will run 4 chains each with their own thread, but sharing the same data over all of the threads. Its nice for models with a lot of data as instead of making a copy of the data for each chain we just have one copy of the data that is shared across all of the chains.

1 Like

In this case, the model has two parameters. The data, for one subject, is 1000 trials (or timesteps, in this case) where each trial/timestep has two datapoints (a choice, and a reward). So for 50 subjects, that’s 50*1000*2=100 000 datapoints. When I tried MCMC for 50 subjects, it took 2h. I don’t want to know how long 1000 subjects would take :)

You can make your code more efficient in several ways.

  • This can be vectorized
  vector<lower=0, upper=1>[N] A;
  vector<lower=0, upper=5>[N] tau;

  for (i in 1:N) {
    A[i]   = Phi_approx(mu_pr[1]  + sigma[1]  * A_pr[i]);
    tau[i] = Phi_approx(mu_pr[2] + sigma[2] * tau_pr[i]) * 5;

to this:

vector<lower=0, upper=1>[N] A = Phi_approx(mu_pr[1]  + sigma[1]  * A_pr);
vector<lower=0, upper=5>[N] tau = Phi_approx(mu_pr[2] + sigma[2] * tau_pr) * 5;
  • The subject and trial loop can be vectorized. It’d be faster to use a loop to define a vector ev and then vectorize the call to categorical logit outside the loop. It’s a bit tricky to code as you want to put the ev values into an array and flatten the choice vector to just the relevant ones. I’d also suggest replacing the ragged wide form coding of choice with a long form version that could be sliced in the way described in the User’s Guide for ragged data.

Unfortunately, none of this is going to make it orders of magnitude faster because you have intrinsically ragged operations that aren’t easy to vectorize. On the other hand, this model would be easy to parallelize over subjects.

1 Like