Looping over many fits fills hard drive

I am fitting a model 9000 times to random samples from my data using for loops. One run of the code fills about 70GB of hard drive space. Any idea what’s going on?

The model is compiled outside the loop and then fit using the sampling function. No objects exist in the R environment after the run that would add up to 70GB. The code is run through the following R chuck in an Rmd file.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

hurdle_sim <- tibble(iter = numeric(),
                coverage = numeric(),
                MO_true = numeric(),
                MO_pred = numeric()
                )

# Compile Stan model to avoid recompiling in the loop.
hurdle_model <- stan_model(file = "interaction_hurdle_model.stan")

for (i in 1:1000) {
  for (j in seq(10, 90, by = 10)) {
sim_dat <- sample_n(ffmm.dat.p2001, 134)
sim_obs <- sample_n(sim_dat, round(134 * (j  / 100)))

stan_dat <- list(N = length(sim_obs$MO_TotCat),
             y = sim_obs$MO_TotCat,
             unobs = 134 - length(sim_obs$MO_TotCat),
             shape = MO_shape_hand,
             rate = MO_rate_hand)

fit <- sampling(object = hurdle_model,
        data = stan_dat,
        iter = 1000,
        chains = 4,
        open_progress = F,
        verbose = F)

pred <- as.data.frame(extract(fit, pars = "y_pred")) %>%
  summarize_all(mean) %>%
  mutate(sum_unobs = rowSums(.))

MO_pred <- sum(sim_obs$MO_TotCat) + pred$sum_unobs

hurdle_sim <- hurdle_sim %>%
  add_row(iter = i,
          coverage = j,
          MO_true = sum(sim_dat$MO_TotCat, na.rm = T),
          MO_pred = MO_pred)
  }
}

Operating System: Windows 10
R version: 3.5.1
RStan release: 2.17.2

A quick follow up:

I’ve found the files that account for the 70GB of hard drive space. Each iteration of the loop writes four files to User/AppData/Local/Temp. I assume these files correspond to the four parallel chains I’m running. Each file is uniquely named by Rtmp******, where * is a random letter or number.

These files are written regardless of whether I run the code as an R chunk or in a standalone R script.

My current workaround is to manually delete the thousands of Rtmp files after each simulation, but this seems like it shouldn’t be necessary. Is there someway to have rstan clean up the Rtmp files after each iteration so they don’t accumulate?

Might this be a bug in rstan?

@sweenejo, for what it is worth - I had to make some modifications to my simulation code when I began estimating in rstan. I don’t know the size of your files or the specific model(s) that you are estimating, but maybe my approach can at least help somewhat:

I also compile the model(s) before running replications. Then, instead of looping over replications I switched my code over to use the sapply function, so that if I’m running 50 replications I do something like this: mystan <- sapply(1:50, FUN = function(r) { ### insert estimation, output, garbage clean-up here###}. Within the loop I save my results, remove the fit objects, and run gc() twice before the model estimates again. I used to be able to only run two replications before noticing a significant slow-down, or crashing my virtual instances (oops!) but now I can run through as many replications as I need (only 100 due to the size of the model and speed of compilation) without issue.

Thanks for your response @Shauna_Sweet. I tried removing the stored objects and the double gc() at the end of each iteration, but it wasn’t able to clean up the residual Rtmp directories filling up my hard drive.

I then stumbled across unlink(), which can delete files and folders from R. The last lines inside my loop now look like this:

rm(sim_dat, sim_obs, stan_dat, fit, pred, stan_dat, MO_pred)
gc()
unlink(file.path("C:/Users/j.sweeney/AppData/Local/Temp", "Rtmp*"), recursive = T)

The recursive = T is a necessary argument. It feels like I’m throwing the kitchen sink at this problem, but it does the trick.

2 Likes

@sweeenejo - good luck! I am going to look at adding the unlink() function to my simulation code as well. I feel like any opportunity to reduce the space taken up in the replications is one that shouldn’t be missed!

I am running into the same issue. I think it may be cxxfun_from_dso_bin writing the same executable content to different files, over and over again, and failing to clean up. I tried changing onexit to TRUE on rstan/rstan/R/cxxfunplus.R:88 of 85c9b00a, and it appears to delay filling up the disk, but not eliminate it.

(Ubuntu Linux 18.10, EC2 m5a.4xlarge, 64G RAM, 32G disk, R 3.5.1, rstan HEAD 85c9b00a, trying to model 3003 data sets with 4 chains on 16 cores)

I made a minimal reproduction case and reported this as: https://github.com/stan-dev/rstan/issues/597

Did @bgoodri 's suggestion do the trick? I’m thinking of modifying my existing code with what he recommended so it’s cleaner and doesn’t require the unlink() step.

Yes, that works.