Memory issues with Large Item Response Model

Hi, I am trying to fit a large Item Response Model with 39631 students and around 100 questions total (sparse response matrix since typically most students only answer a few questions)

I’m trying to run a single chain, 1000 samples but I get into some memory issues even if I have around 750Gb of RAM.

I am assuming that the response matrix gets a posterior stored for every iteration which is likely to blow up everything, so I’m wondering if there’s some way not to store that sample or some other best practice to scale the model?

Welcome to the Stan community. Could you share your model code? That would help in diagnosing any issues.

Only draws of parameters, transformed parameters, and generated quantities are stored. The response matrix would not be stored (assuming it is passed as data).

Hi! thank you!

so here’s the code, it’s just the boilerplate 2PL model from the stan page

data {
    int<lower=1> n_users; 
    int<lower=1> n_items;
    int<lower=1> n_interactions;
    
    array[n_interactions] int<lower=1, upper=n_users> user_idx;  
    array[n_interactions] int<lower=1, upper=n_items> item_idx;  
    array[n_interactions] int<lower=0, upper=1> user_item_interaction;   // binary matrix: item X user interaction
  }
parameters {
  real category_appeal;                // mean question difficulty
  vector[n_users] affinity_level;             // ability for j - mean
  vector[n_items] item_appeal;              // difficulty for k
  vector<lower=0>[n_items] item_polarization;    // discrimination of k
  real<lower=0> sigma_appeal;    // scale of difficulties
  real<lower=0> sigma_polarization;   // scale of log discrimination
}
model {
  affinity_level ~ std_normal();
  item_appeal ~ normal(0, sigma_appeal);
  item_polarization ~ lognormal(0, sigma_polarization);
  category_appeal ~ cauchy(0, 5);
  sigma_appeal ~ cauchy(0, 5);
  sigma_polarization ~ cauchy(0, 5);
  
  user_item_interaction ~ bernoulli_logit(item_polarization[item_idx] .* (affinity_level[user_idx] - (item_appeal[item_idx] + category_appeal)));
}

Thanks for sharing. Unfortunately, I don’t see any obvious ways to make your model more memory-efficient. Maybe someone else will have some suggestions

You could always use the thin argument to only save a draw every n iterations. You could then run multiple thinned chains in sequence and combine the draws after-the-fact.

Do you use CmdStan or some other interface?

CmdStan will push your mcmc draws to csv file on hdd, which might help with the memory issues.

1 Like

I’ve been using pystan, do you recommend using CmdStan? I noticed that allows to configure more parameters (including thin)

I would try to use CmdStan.

I think even CmdStanPy tries to read everything to memory so vanilla CmdStan is the best option. (Cc @WardBrian)

You can define thin and other parameters with PyStan too.

I’m using pystan 3.3 and thin is not one of the allowed keyword arguments to pass to the sampler, when I try to pass it I get ValueError: {'json': {'thin': ['Unknown field.']}}

I think num_thin should work, but I need to check this

it works! thank you!