Out of memory when calculating loo with a large log_lik (8G)

Dear Stan community,

May I ask how much memory loo() function suppose to use with a known log_lik matrix size? In my case, log_lik is a matrix of 8G, I only use one core to calculate loo(log_lik, save_psis = FALSE) and it’s aborted giving Error: cannot allocate vector of size 8.1 G. My PC has 32G RAM, when I launched loo the RAM usage went up to 30G in a few seconds and after a few minutes, it just run out of memory.

I also tried on a HPC with 256G RAM, the calculation finished without error and warning, but when calculating for the other model with the same size of log_lik, it went out of memory. I tried to switch the order of these two models for loo calculation just to make sure it is not model specific, and it always fail at the second time. I printed the memory usage after calculating loo, it is only 12G shown below, and the results of loo is very small like 34M which make sense.

[1] "2022-03-16 21:20:56 EDT"
loo net:1: 2612.792 sec elapsed
[1] "memory used:"
12.1 GB
             used    (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells    3141003   167.8    9253544   494.2    11566930   617.8
Vcells 1496597583 11418.2 7875944800 60088.7 12306163748 93888.6
[1] "loo restuls used 34.584888MB"
[1] "------------------------------------------------------------------------"

As you can see here I also called gc() to free some memory between the calculations but it didn’t help.

Is there a way to solve this issue? or if we could mimic cmdstan way to manage RAM? The model fitting never get problem even with all log_lik and posteriors. I am not a CS person, so not sure what’s going on here.

Another side question is loo took almost 1h to finish, do you think it is normal? I am suspecting that most of the time were spent on loading csv by cmdstanr. I will try to extract log_lik first and report the real calculation time.

Thank you very much.

Michelle

Hi,
Can you also tell the number of rows and columns in your log_lik matrix (that is the number of posterior draws and number of observations)? I assume you are using cmdstanr and a hand written Stan code?

1 Like

Thanks, @avehtari, dim(log_lik) returns to me 4000 by 270165, so 1000 samples, 4 chains, 270165 data points.

Btw, to correct the total memory size I have on HPC, it is 125G not 250G, I requested 2 nodes in total it should be 250G, but for some reason, R is not using all memory.

Below is the memory profiling log from my latest test. It seems that during loo calculation memory usage went up to 111G. The calculation took almost 1h. Note that the freeram is 92G after the first loo calculation for the first model. Then it might explain why the second went out of memory because the peak ram usage is 111G.

Totalram:  125.527 GiB
Freeram:   120.708 GiB
[1] "2022-03-17 14:37:47 EDT"
loo cmdstan net:2: 3394.022 sec elapsed
[1] "memory used:"
13.6 GB
Totalram:  125.527 GiB
Freeram:    92.783 GiB
Size:   28.050 GiB
Peak:  111.199 GiB

And here is the code generated this log, mem_used() is from pryr package GitHub - hadley/pryr: Pry open the covers of R and Sys.meminfo(), Sys.procmem() are from memuse package GitHub - shinra-dev/memuse: An R package of utilities for benchmarking and optimization

print(Sys.time())
tic(paste0("loo cmdstan net:", 2))
fit_list_all$loo_list_all1[[2]] <- fit_list_all$fit_list_all1[[2]]$loo(cores = 1, save_psis = FALSE)
toc()
print("memory used:")
mem_used()
Sys.meminfo()
Sys.procmem()
gc()
print(paste0("loo restuls used ", object_size(fit_list_all$loo_list_all1[2])/1e6, "MB"))
print("------------------------------------------------------------------------")

Sorry to reply with multiple posts, I just got the other test results. Regarding my previous side question,

It seems indeed most of the computation time in fit$loo() is spent on reading csv. I first load csv by data.table and extracted log_lik matrix, then used loo.matrix method, which only took 1186s instead of 3394s using fit$loo(). Please see my dirty code below and the output log

print(Sys.time())
tic(paste0("loo cmdstan net:", 2))
fit_list_all$loo_list_all1[[2]] <- fit_list_all$fit_list_all1[[2]]$loo(cores = 1, save_psis = FALSE)
toc()
print("memory used:")
mem_used()
Sys.meminfo()
Sys.procmem()
gc()
print(paste0("loo restuls used ", object_size(fit_list_all$loo_list_all1[2])/1e6, "MB"))
print("------------------------------------------------------------------------")

print(Sys.time())
tic(paste0("loo net:", 2))
# extract cmdstan log_lik -------------------------------------------------------------------------
extract_log_lik <- function(fit) {
  cmdstanfiles <- list()
  for (f in fit$output_files()) {
    cmdstanfiles[[f]] <- data.table::fread(cmd = paste0("grep -v '^#' --color=never ", f))
  }
  LLmat <- cmdstanfiles %>%
    bind_rows() %>%
    dplyr::select(starts_with(c("log_lik"))) %>%
    as.matrix()
  return(LLmat)
}
LLmat <- extract_log_lik(fit_list_all$fit_list_all1[[2]])
rel_n_eff <- relative_eff(exp(LLmat), chain_id = rep(1:4, each = 1000))
fit_list_all$loo_list_all1[[2]] <- loo(LLmat, r_eff = rel_n_eff, cores = 1, save_psis = FALSE)
rm(list = c("LLmat", "rel_n_eff"))
gc()
toc()
print("memory used:")
mem_used()
Sys.meminfo()
Sys.procmem()
gc()
print(paste0("loo restuls used ", object_size(fit_list_all$loo_list_all1[2])/1e6, "MB"))
print("------------------------------------------------------------------------")
Totalram:  125.527 GiB
Freeram:   120.708 GiB
[1] "2022-03-17 14:37:47 EDT"
loo cmdstan net:2: 3394.022 sec elapsed
[1] "memory used:"
13.6 GB
Totalram:  125.527 GiB
Freeram:    92.783 GiB
Size:   28.050 GiB
Peak:  111.199 GiB
             used    (Mb) gc trigger    (Mb)    max used     (Mb)
Ncells    3316351   177.2   10592529   565.8    13240660    707.2
Vcells 1672780130 12762.3 7207111668 54985.9 14076389974 107394.4
[1] "loo restuls used 39.684408MB"
[1] "------------------------------------------------------------------------"
[1] "2022-03-17 15:34:26 EDT"
             used    (Mb) gc trigger    (Mb)    max used     (Mb)
Ncells    3367475   179.9   10592529   565.8    13240660    707.2
Vcells 1799699325 13730.7 8302733441 63344.9 14076389974 107394.4
loo net:2: 1186.969 sec elapsed
[1] "memory used:"
14.6 GB
Totalram:  125.527 GiB
Freeram:    77.898 GiB
Size:   41.633 GiB
Peak:  111.199 GiB
             used    (Mb) gc trigger    (Mb)    max used     (Mb)
Ncells    3367517   179.9   10592529   565.8    13240660    707.2
Vcells 1799699363 13730.7 5313749403 40540.7 14076389974 107394.4
[1] "loo restuls used 39.684352MB"
[1] "------------------------------------------------------------------------"

After these two loo calculations, I have a third one just to test what will happen with cores = 4, and it went out of memory so aborted the script.

Ok, that’s a lot. How many parameters do you have? If the number of observations per parameter is very big, you may not need loo at all. If you still think loo would be useful, I recommend to use subsampling loo as discussed in the vignette Using Leave-one-out cross-validation for large data • loo

Some Stan developers are examining faster options, but that may take some time before it helps you

1 Like

Thanks, I have 4780 parameters, I am reading this vignette these days, will need some time to digest :)

Using data.table::fread() can speed up the reading process from my experience.

Although there might be better solutions like subsampling loo. It still buzzes me, may I ask how to free the memory after one loo computation? As you can see here the final result of loo is only 30M, but it still takes 12G ram, I wonder how can I free this 12G after loo is finished, gc() seems not helping. Sorry I am not a CS person, so I am not good at programing and so on. Thanks.

That’s a lot, too. It now depends also on the structure of your model what to recommend. Would you like to tell more?

Maybe @jonah can help to answer this?

Yes, sure. I actually mentioned the model in this post How to calculate log_lik in generated quantities of a multivariate regression model - #6 by Michelle. The model is not so complex, For data, I have ~ 350 individuals, each has ~ 2000 observations. These 2000 observations of one individual are not independent but have some level of autocorrelation, and I modeled them using a linear variate-covariate model with a known design matrix plus AR(1) and Gaussian for residuals. Then we also know the parameter for the design matrix of all individuals follows a kind of hierarchical structure. I also expect this hierarchical structure could help to regularize the individual parameter estimations since 2000 observations are sort of noisy. Therefore, the data matrix is modeled as multivariate regression, in which the design matrix is known based on our domain knowledge and follows a hierarchical structure. That’s why I have more than 4000 parameters, which are parameters for the design matrix and AR(1) per individual and other higher group-level parameters. In the end, since we can make different assumptions about the hierarchical structure, I have several different models and I hope model comparison could answer the question of which model is more generalizable or have better prediction ability. I also have one model doing massive univariate regression without hierarchical structure to see whether each individual’s data is actually enough to estimate well the parameter of interest such that we don’t need a multilevel model. You mentioned previously to consider K-fold validation, but if we consider the hierarchical structure I am not sure I can take part of individuals and fit the model since it is hard to decide how to partition the data. Or maybe I should take part of observations and keep all individuals for each fold, but it means I will have to fit the model multiple times and I am not sure if the autocorrelation will cause some potential problem.

May I ask the reason why we don’t need loo at all in this case? What should I consider if I would like to do model comparison?

Thank you in advance @jonah, based on what I checked, the memory usage was only a few M, during loo computation, it could go up to ~90G and take 12-20G after the calculation, but as we know the final results of loo is very small like a few tens of M. I can not free this 12-20 G in R after calculation, I tried gc() it did not work. I also tried ls() to see all objects in the session and print their size, but none of them is larger than 1G. So I guess it might be related to loo or it is an inherent issue of R? This causes a problem that I can not run loo for other models in one script, because this memory usage will accumulate and finally abort the session. For instance, I have 4 models, the first 3 will accumulate around 60G, and loo itself will need 90G for calculation of an 8G log_lik (using more cores is basically not possible in this case). Note that I did rm() followed by gc() to free LLmat and rel_n_eff.

Hmm, I would have suggested gc(), which you already tried.

Is it possible that most of the 12G is taken up by the CmdStanMCMC objects (CmdStanR’s fitted model objects)? If you call any methods that result in the CSV files being read in (e.g., summary(), draws()) then log_lik may be read into memory.

1 Like

Thanks @jonah, sorry for the late reply, I was trying to do some more tests and reports here in case it could help.

Briefly, I did not do any summary(), draws() in the code. Please find the code and output log below, where I got out of memory when using cores = 2 for loo on a HPC node with 125G ram. Before this chunk of code, I only loaded the cmdstanr output file name, not the file content, as you can see from outputs of the first mem_used() and Sys.meminfo() there was not much ram usage. I am not sure exactly how does fit$loo() work, but it took 88G during the calculation. The size of all four csv is around 8G. With 125G ram and cores =2, I can only iterate once before out of memory.

mem_used()
Sys.meminfo()
tic("total")
for (inet in seq(1, 5, by = 1)){
  print(Sys.time())
  tic(paste0("loo net:", inet))
  fit_list_all$loo_list_all1[[inet]] <- fit_list_all$fit_list_all1[[inet]]$loo(cores = 2, save_psis = FALSE)
  print(gc())
  toc()
  print("memory used:")
  print(mem_used())
  print(Sys.meminfo())
  print(Sys.procmem())
  print("loo restuls used:")
  print(object.size(fit_list_all$loo_list_all1[inet]), units = 'Mb')
  print("------------------------------------------------------------------------")
}
toc()

# here is the output log
2.56 GB
Totalram:  125.527 GiB
Freeram:   118.920 GiB
[1] "2022-03-19 11:33:13 EDT"
             used    (Mb)  gc trigger    (Mb)    max used    (Mb)
Ncells    3199607   170.9     9626308   514.2    12032885   642.7
Vcells 1496432873 11416.9 10244684390 78160.8 10145765944 77406.1
loo net:1: 2335.795 sec elapsed
[1] "memory used:"
12.2 GB
Totalram:  125.527 GiB
Freeram:    87.598 GiB
Size:  32.504 GiB
Peak:  88.865 GiB
[1] "loo restuls used:"
14.4 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 12:12:10 EDT"

Then I tried to extract log_lik using data.table::fread, and calculated loo by loo.matrix method. I also added rm(list = c("LLmat", "rel_n_eff")) to free some memory. As you can see in the code and output log below, LLmat is around 8 to 14G matrix. In this case, it seems gc() helped to free the memory and freeram reported by Sys.meminfo() did not accumulate too much. However, in iter 4, freeram dropped to 83G, whereas after previous 3 iters, it was 91G. Then the 5th iter was aborted since it has a huge LLmat size 14G and I assume it will require more than 100G for loo calculations. An interesting difference to fit$loo method is that computation time reduced from 2335s to 702s since reading csv is faster using data.table::fread.

# extract cmdstan log_lik -------------------------------------------------------------------------
extract_log_lik <- function(fit) {
  cmdstanfiles <- list()
  for (f in fit$output_files()) {
    cmdstanfiles[[f]] <- data.table::fread(cmd = paste0("grep -v '^#' --color=never ", f))
  }
  LLmat <- cmdstanfiles %>%
    bind_rows() %>%
    dplyr::select(starts_with(c("log_lik"))) %>%
    as.matrix()
  return(LLmat)
}

mem_used()
Sys.meminfo()
tic("total")
for (inet in seq(1, 5, by = 1)){
  print(Sys.time())
  tic(paste0("loo net:", inet))
  LLmat <- extract_log_lik(fit_list_all$fit_list_all1[[inet]])
  print("LLmat size:")
  print(object.size(LLmat), units = 'Gb')
  rel_n_eff <- relative_eff(exp(LLmat), chain_id = rep(1:4, each = 1000), cores = 2)
  fit_list_all$loo_list_all1[[inet]] <- loo(LLmat, r_eff = rel_n_eff, cores = 2, save_psis = FALSE)
  rm(list = c("LLmat", "rel_n_eff"))
  print(gc())
  toc()
  print("memory used:")
  print(mem_used())
  print(Sys.meminfo())
  print(Sys.procmem())
  print("loo restuls used:")
  print(object.size(fit_list_all$loo_list_all1[inet]), units = 'Mb')
  print("------------------------------------------------------------------------")
}
toc()

# here is the output log
2.56 GB
Totalram:  125.527 GiB
Freeram:   118.984 GiB
[1] "2022-03-19 11:31:39 EDT"
[1] "LLmat size:"
8.1 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3193661  170.6    9618045   513.7    9618045   513.7
Vcells 417147478 3182.6 6708778957 51184.0 7984197202 60914.6
loo net:1: 702.384 sec elapsed
[1] "memory used:"
3.52 GB
Totalram:  125.527 GiB
Freeram:    91.666 GiB
Size:  28.506 GiB
Peak:  68.763 GiB
[1] "loo restuls used:"
33 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 11:43:22 EDT"
[1] "LLmat size:"
9.3 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3314859  177.1    9618045   513.7    9618045   513.7
Vcells 546533654 4169.8 7775981052 59326.1 9229463628 70415.3
loo net:2: 867.967 sec elapsed
[1] "memory used:"
4.56 GB
Totalram:  125.527 GiB
Freeram:    92.384 GiB
Size:  28.502 GiB
Peak:  83.935 GiB
[1] "loo restuls used:"
37.8 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 11:57:50 EDT"
[1] "LLmat size:"
2.9 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3314917  177.1    9618045   513.7    9618045   513.7
Vcells 588067259 4486.6 3981302300 30375.0 9229463628 70415.3
loo net:3: 265.536 sec elapsed
[1] "memory used:"
4.89 GB
Totalram:  125.527 GiB
Freeram:    92.001 GiB
Size:  28.502 GiB
Peak:  83.935 GiB
[1] "loo restuls used:"
12 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 12:02:17 EDT"
[1] "LLmat size:"
10.9 Gb
            used   (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells   3484979  186.2    9954183   531.7    12442728   664.6
Vcells 740346562 5648.4 9290392227 70880.1 10992480762 83866.0
loo net:4: 1104.283 sec elapsed
[1] "memory used:"
6.12 GB
Totalram:  125.527 GiB
Freeram:    83.313 GiB
Size:   37.787 GiB
Peak:  103.247 GiB
[1] "loo restuls used:"
44.7 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 12:20:42 EDT"
[1] "LLmat size:"
14.1 Gb

Until here, my summary is 1) compared to fit$loo(), data.table::fread method reduces a lot computation time by reading csv faster; 2) gc() seems freed the memory in data.table::fread method, but I am not sure why freeram suddenly dropped to 83G after 4th iter. 3) we are not sure if gc() could free memory in fit$loo() method. So I reduced the cores = 1, hopefully we can see if freeram is similar after each iter in fit$loo() method, please see below output log. Bascially, using 1 core allowed less ram usage during loo computation, so R session was aborted after 2 iters. There are several interesting observations: 1) gc() seems not helping to free the memory, since freeram is reducing iter by iter, 87G to 67G 2) computation time is not so different compared to cores = 2, since most of time were used for reading csv 3) the loo restuls object size is larger than the cores = 1 case, for instance, 51.5Mb vs. 14.4 Mb for the 1st iter, which I don’t understand.

2.56 GB
Totalram:  125.527 GiB
Freeram:   118.244 GiB
[1] "2022-03-19 12:42:45 EDT"
             used    (Mb) gc trigger    (Mb)    max used    (Mb)
Ncells    3198105   170.8    9626336   514.2    12032920   642.7
Vcells 1496699258 11419.0 9845017126 75111.6 12306271407 93889.4
loo net:1: 2812.685 sec elapsed
[1] "memory used:"
12.2 GB
Totalram:  125.527 GiB
Freeram:    87.702 GiB
Size:   32.503 GiB
Peak:  104.967 GiB
[1] "loo restuls used:"
51.5 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 13:29:38 EDT"
             used    (Mb)  gc trigger    (Mb)    max used    (Mb)
Ncells    3318024   177.3     9516749   508.3    12032920   642.7
Vcells 2864214235 21852.3 11874811592 90597.7 12787166352 97558.4
loo net:2: 3865.618 sec elapsed
[1] "memory used:"
23.1 GB
Totalram:  125.527 GiB
Freeram:    67.374 GiB
Size:   53.554 GiB
Peak:  118.269 GiB
[1] "loo restuls used:"
59.1 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 14:34:05 EDT"

To verify if loo results object size will change when using different core numbers, I also compute cores = 1 case using data.table::fread method. Please see the output log below. All 5 iters were finished without error since cores = 1 used less ram during loo. The loo results objects sizes for the first 4 iters are identical when using cores = 1 and 2. Using 2 cores also substantially reduced computation time from 995s to 702s.

2.56 GB
Totalram:  125.527 GiB
Freeram:   117.786 GiB
[1] "2022-03-19 13:10:52 EDT"
[1] "LLmat size:"
8.1 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3192037  170.5    9618062   513.7    9618062   513.7
Vcells 417143502 3182.6 7953695475 60681.9 9492400556 72421.3
loo net:1: 995.609 sec elapsed
[1] "memory used:"
3.52 GB
Totalram:  125.527 GiB
Freeram:    90.453 GiB
Size:  28.504 GiB
Peak:  76.813 GiB
[1] "loo restuls used:"
33 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 13:27:29 EDT"
[1] "LLmat size:"
9.3 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3313235  177.0    9618062   513.7    9618062   513.7
Vcells 546529678 4169.7 7183984884 54809.5 9492400556 72421.3
loo net:2: 1180.265 sec elapsed
[1] "memory used:"
4.56 GB
Totalram:  125.527 GiB
Freeram:    89.081 GiB
Size:  28.503 GiB
Peak:  76.813 GiB
[1] "loo restuls used:"
37.8 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 13:47:09 EDT"
[1] "LLmat size:"
2.9 Gb
            used   (Mb) gc trigger    (Mb)   max used    (Mb)
Ncells   3313268  177.0    9618063   513.7    9618063   513.7
Vcells 587556211 4482.7 3678200262 28062.5 9492400556 72421.3
loo net:3: 336.503 sec elapsed
[1] "memory used:"
4.89 GB
Totalram:  125.527 GiB
Freeram:    89.003 GiB
Size:  28.503 GiB
Peak:  76.813 GiB
[1] "loo restuls used:"
12 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 13:52:47 EDT"
[1] "LLmat size:"
10.9 Gb
            used   (Mb)  gc trigger    (Mb)    max used    (Mb)
Ncells   3483355  186.1    10225040   546.1    12781300   682.6
Vcells 740342586 5648.4 10977053903 83748.3 12456596789 95036.3
loo net:4: 1397.951 sec elapsed
[1] "memory used:"
6.12 GB
Totalram:  125.527 GiB
Freeram:    83.095 GiB
Size:   37.311 GiB
Peak:  113.692 GiB
[1] "loo restuls used:"
44.7 Mb
[1] "------------------------------------------------------------------------"
[1] "2022-03-19 14:16:06 EDT"
[1] "LLmat size:"
14.1 Gb
            used   (Mb)  gc trigger    (Mb)    max used    (Mb)
Ncells   3800823  203.0    12552889   670.4    12781300   682.6
Vcells 936846771 7147.6 11052643396 84325.0 12456596789 95036.3
loo net:5: 1867.559 sec elapsed
[1] "memory used:"
7.71 GB
Totalram:  125.527 GiB
Freeram:    83.896 GiB
Size:   36.726 GiB
Peak:  118.533 GiB
[1] "loo restuls used:"
57.5 Mb
[1] "------------------------------------------------------------------------"

All in all, 1) data.table::fread method may be useful especially when log_lik is large. 2) I was wondering why cores = 4 did not improve loo computation time too much in my previous work with much smaller size log_lik, it seems now the underlying reason is csv reading. 3) fit$loo() seems to load all fit csv for loo computation, but after the calculation gc() can not free this ram usage, so it will accumulate from iter to iter. 4) Loo computation in general uses ~10 times log_lik() mat size memory, I don’t know the details of the computaion itself, but do you think if there is a way to free some memory during the calculation?

I apologize for writing too many. Thanks a lot for your help.