STAN scaling limitations

GabeTheEngineer · January 25, 2021, 1:43am

What is a practical limit for the amount of data stan is able to scale to?

I have a population of people each taking a set of questions. The data is a set 1,186,766 measurements of 149 questions and 215,032 people.

Each person took between one and 10 questions and each question was given to a long tailed range of 27 to
95,753 people. As such I’m feeding in a 32,039,768 item, relatively sparse, matrix.

I want to know how well each person did as well as how difficult each question was.

I’m using the second mixture model (model2) from here: example-models/TwentyQuestions_Stan.R at master · stan-dev/example-models · GitHub. While I’ve gotten it running (using rstan), it crashes the R kernel.

My question is what are the practical computational limitations for stan? Is it practical to scale resources to accomplish this or should I give up on this approach?

jsocolar · January 25, 2021, 3:04am

You should be able to dramatically reduce the dataset size with an appropriately indexed “long” data structure that omits rows for person/question combinations that don’t exist. That is, instead of using the row and column indices to keep track of the person and question, use a 1,186,766-row array with columns for person-ID, question-ID, and the Bernoulli outcome. This is a dataset size that is well within Stan’s wheelhouse, though it might be slow. Also, you should be able to get a substantial efficiency gain compared to the example model by vectorizing the code.

GabeTheEngineer · January 25, 2021, 2:06pm

Thank you. I will look for long data and vectorized examples for stan. If you have any that would be good for me to work off of, please share them as well :-).

andrjohns · January 25, 2021, 3:10pm

I would definitely recommend @jsocolar’s suggestions here, I’ll just add a little more info.

As an example on a smaller scale, let’s say you assessed 10 people with up to 5 questions. The ‘wide’ data would look like:

> library(tidyverse)
> set.seed(2021)
> 
> dat_wide = structure(sample(c(0,1,NA),50,replace = T),dim=c(10,5))
> dat_wide
      [,1] [,2] [,3] [,4] [,5]
 [1,]   NA    0    1    0   NA
 [2,]    1   NA    1   NA   NA
 [3,]    1   NA    0    1    1
 [4,]    1    0    1    1    0
 [5,]   NA   NA   NA   NA    1
 [6,]    1   NA    1   NA   NA
 [7,]    1    1   NA    1    1
 [8,]   NA   NA    0    1   NA
 [9,]    1    0    1    1    0
[10,]    1   NA    0    1    0

Given that there are several person-question combinations that have no responses, it’s computationally wasteful to be estimating the p, q, and theta parameters for these combinations. So the natural choice is to not do that (although this is only appropriate when these parameters are independent between person-item combinations).

To reduce this to a ‘long’ format, you have one row per response with unused combinations omitted:

> dat_long = 
+     dat_wide %>%
+     data.frame() %>%
+     mutate(ID = 1:nrow(.)) %>%
+     pivot_longer(X1:X5, values_drop_na=T,
+                  names_to = "Q", values_to="k") %>%
+     mutate(Q = as.integer(factor(Q)))
> 
> dat_long
# A tibble: 32 x 3
      ID     Q     k
   <int> <int> <dbl>
 1     1     2     0
 2     1     3     1
 3     1     4     0
 4     2     1     1
 5     2     3     1
 6     3     1     1
 7     3     3     0
 8     3     4     1
 9     3     5     1
10     4     1     1
# … with 22 more rows

The corresponding Stan model for this construction is then:

model2 <- "
data { 
  int<lower=0> M;  // Total number responses
  int<lower=0> N_ID; // Total number of individuals
  int<lower=0> N_Q; // Total number of questions
  int ID[M];
  int Q[M];
  int k[M];
}
parameters {
  vector<lower=0,upper=1>[N_ID] p;
  vector<lower=0,upper=1>[N_Q] q;
} 
model {
  // Priors For People and Questions
  p ~ beta(1, 1);
  q ~ beta(1, 1);

  k ~ bernoulli(p[ID] .* q[Q]);
}
"

Which can be called with the above data via:

dat_list = list(
  M = nrow(dat_long),
  N_ID = length(unique(dat_long$ID)),
  N_Q = length(unique(dat_long$Q)),
  ID = dat_long$ID,
  Q = dat_long$Q,
  k = dat_long$k)

mod = stan(model_code = model2,
           data = dat_list)

andrjohns · January 25, 2021, 3:16pm

As for the implementation, I would recommend you switch from RStan to cmdstanR: R Interface to CmdStan • cmdstanr

RStan pre-allocates the memory for all iterations in RAM, so can cause crashes in larger models. cmdstanR instead writes the iterations directly to the disk, so you only need enough RAM for the operations within each iteration (generally speaking). Additionally, cmdstanR uses a much more recent version of Stan, which includes support for GPU acceleration of the bernoulli distribution (as of the upcoming release) which should provide significant speed-ups for your model

GabeTheEngineer · January 25, 2021, 3:32pm

Wonderful! Thank you!

GabeTheEngineer · February 5, 2021, 3:18pm

I ended up having to run on a slightly larger dataset than I’d hoped. 180 hours in i’m about 1000 iterations in. Out of curiosity, is it possible to rewrite this to utilize within-chain parallelization? (I’m trying to follow the documentation but I just lack so much experience with stan that I’m not following it well.)

andrjohns · February 5, 2021, 4:08pm

Assuming the model is the same that I posted above, it’s very easy to add parallelism:

functions {
  real partial_sum_lupmf(int[] slice_k,
                        int start, int end,
                        int[] ID, int[] Q,
                        vector p, vector q) {
    return bernoulli_logit_lupmf(slice_k | p[ID[start:end]] .* q[Q[start:end]]);
  }
}
data { 
  int<lower=0> M;  // Total number responses
  int<lower=0> N_ID; // Total number of individuals
  int<lower=0> N_Q; // Total number of questions
  int ID[M];
  int Q[M];
  int k[M];
}
parameters {
  vector<lower=0,upper=1>[N_ID] p;
  vector<lower=0,upper=1>[N_Q] q;
} 
model {
  int grainsize = 1;

  // Priors For People and Questions
  p ~ beta(1, 1);
  q ~ beta(1, 1);

  target += reduce_sum(partial_sum_lupmf, k, grainsize,
                       ID, Q, p, q);
}

Obviously completely untested, so make sure you do some tests on smaller subsets of your data to check that the results are sensible (and don’t error)

GabeTheEngineer · February 8, 2021, 4:57pm

Thanks for the help. First shot w/ larger dataset errored out after 10 days. (No clue why the 1 file was missing though it was disappointing to have lost the run rather than just the chain.)

Going to rerun with 5 cores per chain and see if I can get it redone mid week. (I assume it should be bernoulli_logit_lpmf rather than lupmf.)

rok_cesnovar · February 8, 2021, 5:00pm

Have you already closed the R session? If not, then see what

tempdir()

returns and go check that folder for CSV files.

GabeTheEngineer · February 8, 2021, 5:52pm

I actually went into that folder that’s listed. It has files for 3 of the chains but not the 4th. No clue why.

Ah, but I see you are suggesting something else. I actually started a new run with the parallelized model, but will give it a shot when/if this run fails.

andrjohns · February 8, 2021, 11:58pm

I assume it should be bernoulli_logit_lpmf rather than lupmf.

No the lupmf was intentional. The lupmf suffix drops normalising constants (like the ~ sampling) and so can be faster for some distributions

GabeTheEngineer · February 10, 2021, 1:10am

Humm. I got this error when using lumpf:
partial_sum_lupmf is an invalid user-defined function name. User-defined probability mass and density functions must be defined as normalized (function names should end with _lpdf/_lpmf not _lupdf/_lupmf).

(I also notice now the error was on partial_sum_lumpf, not bernoulli_logit_lupmf as I indicated above.)

andrjohns · February 10, 2021, 1:13am

Ah yes that was a typo on my part, that should have been partial_sum_lpmf

GabeTheEngineer · February 13, 2021, 9:16pm

Is it possible refreshes don’t work the same way with the parallelized model? I’ve run it for 220 CPU hours per thread (which is about how many CPU hours it took non-parallelized) and while the no-parallelized one updated at the expected refresh rate, I haven’t gotten a refresh from this model (though I can see it’s still going full-speed ahead based on CPU utilization).

GabeTheEngineer · February 18, 2021, 3:46pm

Chain 4 Iteration: 100 / 2000 [ 5%] (Warmup)
Chain 1 finished unexpectedly!
Chain 2 finished unexpectedly!
Chain 3 finished unexpectedly!
Chain 4 finished unexpectedly!
All chains finished unexpectedly!
Use read_cmdstan_csv() to read the results of the failed chains.No chains finished successfully. Unable to retrieve the fit.The working directory was changed to /Users/v685573 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.

I suppose parallelizing it didn’t work out. Oh well.

(I suspect this was because the model etc was on a network drive which became disconnected for some reason)

Topic		Replies	Views
Stuck at Warmup iteration with no error : CmdStanR CmdStan techniques , fitting-issues	48	3275	April 21, 2020
Using Stan on a computing cluster. Any advice? CmdStan	20	5135	January 10, 2019
Model with sample data runs fine; actual data of 1000 participants returns several errors Modeling	15	785	October 2, 2020
Stan on GPU: looking for model+dataset examples for empirical evaluation of speedups General	36	3551	March 5, 2018
Linear, parallell regression CmdStan	89	5511	September 20, 2018

STAN scaling limitations

Related topics