Estimating memory requirements based on model syntax

Happy New Year, everyone! I reviewed the previous topics about memory but can’t find a response, so I am creating a new one. I am sorry if I am duplicating information.

I fit a model to a relatively large dataset using the university’s HPC and need to estimate the required memory to run memory-efficient jobs. Here is the model syntax.
dghirt.stan (7.3 KB)

This is the code I am using to fit the model (as a test run, but also initialize the model later using some starting parameters)

data_resp <- list(
  J              = length(unique(mpr_long$iid)),
  I              = length(unique(mpr_long$pid)),
  n_obs          = nrow(mpr_long),
  p_loc          = mpr_long$pid,
  i_loc          = mpr_long$iid,
  RT             = log(mpr_long$task_response_time_ms/1000),
  Y              = mpr_long$task_grade
)

# Compile the model syntax

mod <- cmdstan_model('./dghirt.stan')

# Fit the model

fit_init <- mod$sample(
  data            = data_resp,
  seed            = 1234,
  chains          = 1,
  iter_warmup     = 150,
  iter_sampling   = 250,
  refresh         = 10,
  adapt_delta     = 0.95)

# Save the model obect

fit_init$save_object('./dghirt_mpr_initialize.RDS')

For one of the datasets I am using, I equals 170,341 and J equals 2504, and the dimensions of all parameters in the model syntax are defined based on I and J.

When I initially ran this model with these settings, it finished the iterations but was killed due to OOM when it was saving the model object at the last line. I requested 8 GB memory, and it used the max of 5.29 GB before it killed the job.

State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 2-02:45:11
CPU Efficiency: 99.31% of 2-03:06:29 core-walltime
Job Wall-clock time: 2-03:06:29
Memory Utilized: 5.29 GB
Memory Efficiency: 66.18% of 8.00 GB

Then, I increased the memory and resubmitted the job. It ran iterations and finished with success after saving the model object. This time, it used 22.26 GB.

State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 4-07:52:30
CPU Efficiency: 99.34% of 4-08:33:47 core-walltime
Job Wall-clock time: 4-08:33:47
Memory Utilized: 22.26 GB
Memory Efficiency: 92.74% of 24.00 GB

So, I assume most of the memory was consumed at the end when it was saving the model object, and I assume the size of this model object is a function of

  • number of model parameters
  • number of chains
  • number of sampling iterations per chain

How do I verify that it took about 17 GB of memory when saving the model object based on the dimensions of parameters in the model?

I want to fit the same model to the data with four chains, 250 warm-up iterations, and 1250 sampling iterations.

fit <- mod$sample(
  data            = data_resp,
  seed            = 1234,
  chains          = 1,
  parallel_chains = 4,
  iter_warmup     = 250,
  iter_sampling   = 1250,
  refresh         = 10,
  init            = list(start,start,start,start))

# Save the model object

fit$save_object('/gpfs/projects/edquant/cengiz/duolingo_dghirt/dghirt_mpr.RDS')

I am trying to predict how much memory I will need for this new setup. This will take a long time to finish, so I don’t want to wait a couple of weeks and find out the OOM error at the end when saving the model object.

Assuming it consumes around 5.3GB when running the model using a simple chain, I think it will take about 21GB-22GB to run four chains simultaneously. Then, how do I scale up the memory required to save this object at the end, given that it took 17 GB during the initial run with 150 warmups and 250 sampling iterations?

Thank you for any input and guidance on understanding how memory requirements work for running jobs and saving model objects. I think it would be helpful to understand how much memory is required to run the iterations and save the model object based on the model syntax (using the size of model parameters).

#####################################################

Additional information:

When I repeat the process for a different dataset (I = 170341, J = 5886). It killed the job due to OOM first. Note that this has 100 warmup iterations and 200 sampling iterations

fit <- mod$sample(
  data            = data_resp,
  seed            = 1234,
  chains          = 1,
  iter_warmup     = 100,
  iter_sampling   = 200,
  refresh         = 10,
  adapt_delta     = 0.95)

# Compile the output files into an rstan object

fit$save_object('./dghirt_vic.RDS')
State: OUT_OF_MEMORY (exit code 0)
Cores: 1
CPU Utilized: 2-22:10:13
CPU Efficiency: 99.28% of 2-22:40:53 core-walltime
Job Wall-clock time: 2-22:40:53
Memory Utilized: 6.47 GB
Memory Efficiency: 64.68% of 10.00 GB

Then, it finished successfully after increasing memory.

State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 2-15:33:32
CPU Efficiency: 99.36% of 2-15:58:04 core-walltime
Job Wall-clock time: 2-15:58:04
Memory Utilized: 18.40 GB
Memory Efficiency: 76.67% of 24.00 GB

There are two sources of memory pressure.

  1. The biggest source of dynamic memory pressure is automatic differentiation. This happens for each evaluation of the log density and gradient. It builds up a directed acyclic graph of all of the arithmetic involved and requires 16 bytes per parameter and 32–48 bytes per arithmetic operation depending on things like caching of intermediates.

  2. Saving the draws, which is going to require 8 bytes per parameter per draw. I believe there’s a way to set cmdstanr to stream this to disk, which will mean it’ll only take about one or two of these (that is, about 16 bytes/parameter). So if you can stream out the draws and then read them in later, you won’t run into memory issues from this.

Your model has \mathcal{O}(I + J) parameters.

If you’re going to stick to binary covariance matrices, you might want to just factor into two scales \sigma_1, \sigma_2 \in (0, \infty) and a correlation \rho \in (-1, 1). The LKJ prior on a 2 x 2 matrix is just a prior on \rho, and in your case of \text{LKJ}(1), it’s just uniform.

If you’d rather stick to matrices, I’d suggest using Cholesky factors as it saves factoring in the distribution. This will also cut down a bit on memory overhead.

We’re about to roll out a built-in sum-to-zero vectror type with a more robust parameterization than negating the sum of the other entries.

You can also cut down on memory pressure by reducing the number of redundant calculations in your likelihood. Calculate things once and save them and reuse. This goes for the log, log1m and Bernoulli logit calculations.

There are some other things that could be vectorized to reduce code size, but nothing else that’s going to help a lot with memory.

Thank you for the suggestions. After reviewing the suggestions for a similar post I did years ago for a reduced version of this model, and your additional tips, this is what I came up with. I will test it and see how much faster it runs and how much difference it makes in memory usage.

// Stan model syntax combining the Modified DG-IRT for response accuracy and
// DG-LNRT for response time

data{
    int <lower=1> J;                       // number of examinees          
    int <lower=1> I;                       // number of items
    int <lower=1> n_obs;                   // number of observations (J x I - missing responses)
    array[n_obs] int<lower=1> p_loc;       // person indicator   
    array[n_obs] int<lower=1> i_loc;       // item indicator
    array[n_obs] real RT;                  // vector of log of responses
    array[n_obs] int<lower=0,upper=1> Y;   // vector of item responses
}

parameters {
  
// Parameters common for both response time and response accuracy component  
  
  vector<lower=0,upper=1>[I] pC; // vector of length I for the probability of item compromise status
  
  vector<lower=0,upper=1>[J] pH; // vector of length J for the probability of examinee item peknowledge 

// Parameters for the item parameters component  

  real mu_beta;                     // mean for time intensity parameters
  real mu_alpha;                    // mean for log of time discrimination parameters
  real mu_b;                        // mean for item difficulty parameters
  vector<lower=0>[3] sigma_I;       // vector of standard deviations for item parameters
  cholesky_factor_corr[3] Lomega_I; // Cholesky factors of 3x3 correlation matrix for item parameters
  
// Parameters for the person parameters component  
  
  real mu_thetat;                   // mean for theta_t
  vector<lower=0>[2] sigma_P;       // vector of standard deviations for latent trait parameters
  cholesky_factor_corr[2] Lomega_P; // Cholesky factors of 2x2 correlation matrix for person parameters


  real<lower=0> mu_delta_tau;         // mean for change from taut to tau_c
  real<lower=0> mu_delta_theta;       // mean for change from thetat to thetac
  vector<lower=0>[2] sigma_D;        // vector of standard deviations for change in person parameters
  cholesky_factor_corr[2] Lomega_D;  // Cholesky factors of 2x2 correlation matrix for person parameters
 
  array[I] vector[3] item;
  array[J] vector[2] person;
  array[J] vector<lower=0>[2] delta;
}


transformed parameters{
  vector[2] mu_P = [0,mu_thetat]';                 // vector for mean of person parameters
                                                   // mu_taut is fixed to 0
  vector[3] mu_I = [mu_alpha,mu_beta,mu_b]';       // vector for mean of item parameters
  vector[2] mu_D = [mu_delta_tau,mu_delta_theta]'; // vector for mean of change parameters

  // The sum of b parameters is fixed to 0 for the response accuracy component
  // this is for model identification for the response accuracy component
  
  vector[I] b_star;
  
  for(i in 1:I){
   b_star[i]= item[i][3];
  }
  
  vector[I] b = b_star;
   b[1] = -sum(b_star[2:I]);
}

model{

  sigma_P   ~ exponential(1);
  mu_thetat ~ normal(0,1);
  Lomega_P  ~ lkj_corr_cholesky(1);
    
  sigma_I   ~ exponential(1);
  mu_beta   ~ normal(4,1);
  mu_alpha  ~ lognormal(0,0.5);
  mu_b      ~ normal(0,1);
  Lomega_I  ~ lkj_corr_cholesky(1);
   
  sigma_D        ~ exponential(1);
  mu_delta_tau   ~ normal(0,1);
  mu_delta_theta ~ normal(0,1);
  Lomega_D       ~ lkj_corr_cholesky(1);
 
  person     ~  multi_normal_cholesky(mu_P,diag_pre_multiply(sigma_P, Lomega_P));
  item       ~  multi_normal_cholesky(mu_I,diag_pre_multiply(sigma_I, Lomega_I));
  delta      ~  multi_normal_cholesky(mu_D,diag_pre_multiply(sigma_D, Lomega_D));

  pC ~ beta(1,1);
  pH ~ beta(1,1);
  
// Joint density of response time and response accuracy  

  vector[I] log1m_pC = log1m(pC) ;
  vector[I] log_pC   = log(pC) ;
  vector[J] log1m_pH = log1m(pH) ;
  vector[J] log_pH   = log(pH) ;

  vector[n_obs] norm_p_t ;
  vector[n_obs] norm_p_c;
  vector[n_obs] bern_p_t ;
  vector[n_obs] bern_p_c;
  
  for (i in 1:n_obs) {
    // item[i_loc[i]][2] is the beta parameter for the (i_loc[i])th item
    // item[i_loc[i]][1] is the alpha parameter for the (i_loc[i])th item
    // person[p_loc[i]][1] is tau_t for the (p_loc[i])th person
    // (person[p_loc[i]][1]+delta[p_loc[i]][1]) is tau_c for the (p_loc[i])th person
    
    real p_taut = item[i_loc[i]][2] - person[p_loc[i]][1];
    real p_tauc = item[i_loc[i]][2] - (person[p_loc[i]][1]+delta[p_loc[i]][1]);
    real alpha  = 1/exp(item[i_loc[i]][1]);
    norm_p_t[i] = normal_lpdf(RT[i] | p_taut, alpha);  
    norm_p_c[i] = normal_lpdf(RT[i] | p_tauc, alpha);  
    
    // b[i_loc[i]] is the b parameter for the (i_loc[i])th item
    // person[p_loc[i]][2] is theta_t for the (p_loc[i])th person
    // (person[p_loc[i]][2]+delta[p_loc[i]][2]) is theta_c for the (p_loc[i])th person

    real p_thetat = person[p_loc[i]][2] - b[i_loc[i]];  
    real p_thetac = (person[p_loc[i]][2]+delta[p_loc[i]][2]) - b[i_loc[i]];  
    bern_p_t[i] = bernoulli_logit_lpmf(Y[i] | p_thetat); 
    bern_p_c[i] = bernoulli_logit_lpmf(Y[i] | p_thetac); 
  }
  
  matrix[n_obs,4] lprt;
  lprt[,1] = log1m_pC[i_loc] + log1m_pH[p_loc] + norm_p_t;  // T = 0, C=0
  lprt[,2] = log1m_pC[i_loc] + log_pH[p_loc]   + norm_p_t;  // T = 1, C=0
  lprt[,3] = log_pC[i_loc]   + log1m_pH[p_loc] + norm_p_t;  // T = 0, C=1
  lprt[,4] = log_pC[i_loc]   + log_pH[p_loc]   + norm_p_c;  // T = 1, C=1 

  matrix[n_obs,4] lpr;
  lpr[,1] = log1m_pC[i_loc] + log1m_pH[p_loc] + bern_p_t;  // T = 0, C=0
  lpr[,2] = log1m_pC[i_loc] + log_pH[p_loc]   + bern_p_t;  // T = 1, C=0
  lpr[,3] = log_pC[i_loc]   + log1m_pH[p_loc] + bern_p_t;  // T = 0, C=1
  lpr[,4] = log_pC[i_loc]   + log_pH[p_loc]   + bern_p_c;  // T = 1, C=1 

  vector[n_obs] lps;
  for (i in 1:n_obs) {
    lps[i] = log_sum_exp(lprt[i,]) + log_sum_exp(lpr[i,]);
  }
}




I ran the code on simulated data with I=50, J=500, and N=25000. For a single chain with 150 warm-up iterations and 150 sampling iterations, the total time is reduced by about 60% (from 13391 seconds to 5316 seconds).

Thank you again for all the tips and suggestions.