Looping over many fits fills hard drive

rstan

#1

I am fitting a model 9000 times to random samples from my data using for loops. One run of the code fills about 70GB of hard drive space. Any idea what’s going on?

The model is compiled outside the loop and then fit using the sampling function. No objects exist in the R environment after the run that would add up to 70GB. The code is run through the following R chuck in an Rmd file.

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

hurdle_sim <- tibble(iter = numeric(),
                coverage = numeric(),
                MO_true = numeric(),
                MO_pred = numeric()
                )

# Compile Stan model to avoid recompiling in the loop.
hurdle_model <- stan_model(file = "interaction_hurdle_model.stan")

for (i in 1:1000) {
  for (j in seq(10, 90, by = 10)) {
sim_dat <- sample_n(ffmm.dat.p2001, 134)
sim_obs <- sample_n(sim_dat, round(134 * (j  / 100)))

stan_dat <- list(N = length(sim_obs$MO_TotCat),
             y = sim_obs$MO_TotCat,
             unobs = 134 - length(sim_obs$MO_TotCat),
             shape = MO_shape_hand,
             rate = MO_rate_hand)

fit <- sampling(object = hurdle_model,
        data = stan_dat,
        iter = 1000,
        chains = 4,
        open_progress = F,
        verbose = F)

pred <- as.data.frame(extract(fit, pars = "y_pred")) %>%
  summarize_all(mean) %>%
  mutate(sum_unobs = rowSums(.))

MO_pred <- sum(sim_obs$MO_TotCat) + pred$sum_unobs

hurdle_sim <- hurdle_sim %>%
  add_row(iter = i,
          coverage = j,
          MO_true = sum(sim_dat$MO_TotCat, na.rm = T),
          MO_pred = MO_pred)
  }
}

Operating System: Windows 10
R version: 3.5.1
RStan release: 2.17.2


#2

A quick follow up:

I’ve found the files that account for the 70GB of hard drive space. Each iteration of the loop writes four files to User/AppData/Local/Temp. I assume these files correspond to the four parallel chains I’m running. Each file is uniquely named by Rtmp******, where * is a random letter or number.

These files are written regardless of whether I run the code as an R chunk or in a standalone R script.

My current workaround is to manually delete the thousands of Rtmp files after each simulation, but this seems like it shouldn’t be necessary. Is there someway to have rstan clean up the Rtmp files after each iteration so they don’t accumulate?

Might this be a bug in rstan?


#3

@sweenejo, for what it is worth - I had to make some modifications to my simulation code when I began estimating in rstan. I don’t know the size of your files or the specific model(s) that you are estimating, but maybe my approach can at least help somewhat:

I also compile the model(s) before running replications. Then, instead of looping over replications I switched my code over to use the sapply function, so that if I’m running 50 replications I do something like this: mystan <- sapply(1:50, FUN = function(r) { ### insert estimation, output, garbage clean-up here###}. Within the loop I save my results, remove the fit objects, and run gc() twice before the model estimates again. I used to be able to only run two replications before noticing a significant slow-down, or crashing my virtual instances (oops!) but now I can run through as many replications as I need (only 100 due to the size of the model and speed of compilation) without issue.


#4

Thanks for your response @Shauna_Sweet. I tried removing the stored objects and the double gc() at the end of each iteration, but it wasn’t able to clean up the residual Rtmp directories filling up my hard drive.

I then stumbled across unlink(), which can delete files and folders from R. The last lines inside my loop now look like this:

rm(sim_dat, sim_obs, stan_dat, fit, pred, stan_dat, MO_pred)
gc()
unlink(file.path("C:/Users/j.sweeney/AppData/Local/Temp", "Rtmp*"), recursive = T)

The recursive = T is a necessary argument. It feels like I’m throwing the kitchen sink at this problem, but it does the trick.


#5

@sweeenejo - good luck! I am going to look at adding the unlink() function to my simulation code as well. I feel like any opportunity to reduce the space taken up in the replications is one that shouldn’t be missed!