Happy New Year, everyone! I reviewed the previous topics about memory but can’t find a response, so I am creating a new one. I am sorry if I am duplicating information.
I fit a model to a relatively large dataset using the university’s HPC and need to estimate the required memory to run memory-efficient jobs. Here is the model syntax.
dghirt.stan (7.3 KB)
This is the code I am using to fit the model (as a test run, but also initialize the model later using some starting parameters)
data_resp <- list(
J = length(unique(mpr_long$iid)),
I = length(unique(mpr_long$pid)),
n_obs = nrow(mpr_long),
p_loc = mpr_long$pid,
i_loc = mpr_long$iid,
RT = log(mpr_long$task_response_time_ms/1000),
Y = mpr_long$task_grade
)
# Compile the model syntax
mod <- cmdstan_model('./dghirt.stan')
# Fit the model
fit_init <- mod$sample(
data = data_resp,
seed = 1234,
chains = 1,
iter_warmup = 150,
iter_sampling = 250,
refresh = 10,
adapt_delta = 0.95)
# Save the model obect
fit_init$save_object('./dghirt_mpr_initialize.RDS')
For one of the datasets I am using, I equals 170,341 and J equals 2504, and the dimensions of all parameters in the model syntax are defined based on I and J.
When I initially ran this model with these settings, it finished the iterations but was killed due to OOM when it was saving the model object at the last line. I requested 8 GB memory, and it used the max of 5.29 GB before it killed the job.
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 2-02:45:11
CPU Efficiency: 99.31% of 2-03:06:29 core-walltime
Job Wall-clock time: 2-03:06:29
Memory Utilized: 5.29 GB
Memory Efficiency: 66.18% of 8.00 GB
Then, I increased the memory and resubmitted the job. It ran iterations and finished with success after saving the model object. This time, it used 22.26 GB.
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 4-07:52:30
CPU Efficiency: 99.34% of 4-08:33:47 core-walltime
Job Wall-clock time: 4-08:33:47
Memory Utilized: 22.26 GB
Memory Efficiency: 92.74% of 24.00 GB
So, I assume most of the memory was consumed at the end when it was saving the model object, and I assume the size of this model object is a function of
- number of model parameters
- number of chains
- number of sampling iterations per chain
How do I verify that it took about 17 GB of memory when saving the model object based on the dimensions of parameters in the model?
I want to fit the same model to the data with four chains, 250 warm-up iterations, and 1250 sampling iterations.
fit <- mod$sample(
data = data_resp,
seed = 1234,
chains = 1,
parallel_chains = 4,
iter_warmup = 250,
iter_sampling = 1250,
refresh = 10,
init = list(start,start,start,start))
# Save the model object
fit$save_object('/gpfs/projects/edquant/cengiz/duolingo_dghirt/dghirt_mpr.RDS')
I am trying to predict how much memory I will need for this new setup. This will take a long time to finish, so I don’t want to wait a couple of weeks and find out the OOM error at the end when saving the model object.
Assuming it consumes around 5.3GB when running the model using a simple chain, I think it will take about 21GB-22GB to run four chains simultaneously. Then, how do I scale up the memory required to save this object at the end, given that it took 17 GB during the initial run with 150 warmups and 250 sampling iterations?
Thank you for any input and guidance on understanding how memory requirements work for running jobs and saving model objects. I think it would be helpful to understand how much memory is required to run the iterations and save the model object based on the model syntax (using the size of model parameters).
#####################################################
Additional information:
When I repeat the process for a different dataset (I = 170341, J = 5886). It killed the job due to OOM first. Note that this has 100 warmup iterations and 200 sampling iterations
fit <- mod$sample(
data = data_resp,
seed = 1234,
chains = 1,
iter_warmup = 100,
iter_sampling = 200,
refresh = 10,
adapt_delta = 0.95)
# Compile the output files into an rstan object
fit$save_object('./dghirt_vic.RDS')
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 2-22:10:13
CPU Efficiency: 99.28% of 2-22:40:53 core-walltime
Job Wall-clock time: 2-22:40:53
Memory Utilized: 6.47 GB
Memory Efficiency: 64.68% of 10.00 GB
Then, it finished successfully after increasing memory.
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 2-15:33:32
CPU Efficiency: 99.36% of 2-15:58:04 core-walltime
Job Wall-clock time: 2-15:58:04
Memory Utilized: 18.40 GB
Memory Efficiency: 76.67% of 24.00 GB