# STAN scaling limitations

What is a practical limit for the amount of data stan is able to scale to?

I have a population of people each taking a set of questions. The data is a set 1,186,766 measurements of 149 questions and 215,032 people.

Each person took between one and 10 questions and each question was given to a long tailed range of 27 to
95,753 people. As such I’m feeding in a 32,039,768 item, relatively sparse, matrix.

I want to know how well each person did as well as how difficult each question was.

I’m using the second mixture model (model2) from here: example-models/TwentyQuestions_Stan.R at master · stan-dev/example-models · GitHub. While I’ve gotten it running (using rstan), it crashes the R kernel.

My question is what are the practical computational limitations for stan? Is it practical to scale resources to accomplish this or should I give up on this approach?

You should be able to dramatically reduce the dataset size with an appropriately indexed “long” data structure that omits rows for person/question combinations that don’t exist. That is, instead of using the row and column indices to keep track of the person and question, use a 1,186,766-row array with columns for person-ID, question-ID, and the Bernoulli outcome. This is a dataset size that is well within Stan’s wheelhouse, though it might be slow. Also, you should be able to get a substantial efficiency gain compared to the example model by vectorizing the code.

2 Likes

Thank you. I will look for long data and vectorized examples for stan. If you have any that would be good for me to work off of, please share them as well :-).

As an example on a smaller scale, let’s say you assessed 10 people with up to 5 questions. The ‘wide’ data would look like:

``````> library(tidyverse)
> set.seed(2021)
>
> dat_wide = structure(sample(c(0,1,NA),50,replace = T),dim=c(10,5))
> dat_wide
[,1] [,2] [,3] [,4] [,5]
[1,]   NA    0    1    0   NA
[2,]    1   NA    1   NA   NA
[3,]    1   NA    0    1    1
[4,]    1    0    1    1    0
[5,]   NA   NA   NA   NA    1
[6,]    1   NA    1   NA   NA
[7,]    1    1   NA    1    1
[8,]   NA   NA    0    1   NA
[9,]    1    0    1    1    0
[10,]    1   NA    0    1    0
``````

Given that there are several person-question combinations that have no responses, it’s computationally wasteful to be estimating the `p`, `q`, and `theta` parameters for these combinations. So the natural choice is to not do that (although this is only appropriate when these parameters are independent between person-item combinations).

To reduce this to a ‘long’ format, you have one row per response with unused combinations omitted:

``````> dat_long =
+     dat_wide %>%
+     data.frame() %>%
+     mutate(ID = 1:nrow(.)) %>%
+     pivot_longer(X1:X5, values_drop_na=T,
+                  names_to = "Q", values_to="k") %>%
+     mutate(Q = as.integer(factor(Q)))
>
> dat_long
# A tibble: 32 x 3
ID     Q     k
<int> <int> <dbl>
1     1     2     0
2     1     3     1
3     1     4     0
4     2     1     1
5     2     3     1
6     3     1     1
7     3     3     0
8     3     4     1
9     3     5     1
10     4     1     1
# … with 22 more rows
``````

The corresponding Stan model for this construction is then:

``````model2 <- "
data {
int<lower=0> M;  // Total number responses
int<lower=0> N_ID; // Total number of individuals
int<lower=0> N_Q; // Total number of questions
int ID[M];
int Q[M];
int k[M];
}
parameters {
vector<lower=0,upper=1>[N_ID] p;
vector<lower=0,upper=1>[N_Q] q;
}
model {
// Priors For People and Questions
p ~ beta(1, 1);
q ~ beta(1, 1);

k ~ bernoulli(p[ID] .* q[Q]);
}
"
``````

Which can be called with the above data via:

``````dat_list = list(
M = nrow(dat_long),
N_ID = length(unique(dat_long\$ID)),
N_Q = length(unique(dat_long\$Q)),
ID = dat_long\$ID,
Q = dat_long\$Q,
k = dat_long\$k)

mod = stan(model_code = model2,
data = dat_list)
``````
6 Likes

As for the implementation, I would recommend you switch from RStan to cmdstanR: R Interface to CmdStan • cmdstanr

RStan pre-allocates the memory for all iterations in RAM, so can cause crashes in larger models. cmdstanR instead writes the iterations directly to the disk, so you only need enough RAM for the operations within each iteration (generally speaking). Additionally, cmdstanR uses a much more recent version of Stan, which includes support for GPU acceleration of the `bernoulli` distribution (as of the upcoming release) which should provide significant speed-ups for your model

6 Likes

Wonderful! Thank you!

1 Like

I ended up having to run on a slightly larger dataset than I’d hoped. 180 hours in i’m about 1000 iterations in. Out of curiosity, is it possible to rewrite this to utilize within-chain parallelization? (I’m trying to follow the documentation but I just lack so much experience with stan that I’m not following it well.)

Assuming the model is the same that I posted above, it’s very easy to add parallelism:

``````functions {
real partial_sum_lupmf(int[] slice_k,
int start, int end,
int[] ID, int[] Q,
vector p, vector q) {
return bernoulli_logit_lupmf(slice_k | p[ID[start:end]] .* q[Q[start:end]]);
}
}
data {
int<lower=0> M;  // Total number responses
int<lower=0> N_ID; // Total number of individuals
int<lower=0> N_Q; // Total number of questions
int ID[M];
int Q[M];
int k[M];
}
parameters {
vector<lower=0,upper=1>[N_ID] p;
vector<lower=0,upper=1>[N_Q] q;
}
model {
int grainsize = 1;

// Priors For People and Questions
p ~ beta(1, 1);
q ~ beta(1, 1);

target += reduce_sum(partial_sum_lupmf, k, grainsize,
ID, Q, p, q);
}
``````

Obviously completely untested, so make sure you do some tests on smaller subsets of your data to check that the results are sensible (and don’t error)

3 Likes

Thanks for the help. First shot w/ larger dataset errored out after 10 days. (No clue why the 1 file was missing though it was disappointing to have lost the run rather than just the chain.)

Going to rerun with 5 cores per chain and see if I can get it redone mid week. (I assume it should be bernoulli_logit_lpmf rather than lupmf.)

Have you already closed the R session? If not, then see what

``````tempdir()
``````

returns and go check that folder for CSV files.

1 Like

I actually went into that folder that’s listed. It has files for 3 of the chains but not the 4th. No clue why.

Ah, but I see you are suggesting something else. I actually started a new run with the parallelized model, but will give it a shot when/if this run fails.

I assume it should be bernoulli_logit_lpmf rather than lupmf.

No the `lupmf` was intentional. The `lupmf` suffix drops normalising constants (like the `~` sampling) and so can be faster for some distributions

1 Like

Humm. I got this error when using lumpf:
`partial_sum_lupmf is an invalid user-defined function name. User-defined probability mass and density functions must be defined as normalized (function names should end with _lpdf/_lpmf not _lupdf/_lupmf).`

(I also notice now the error was on partial_sum_lumpf, not bernoulli_logit_lupmf as I indicated above.)

Ah yes that was a typo on my part, that should have been `partial_sum_lpmf`

1 Like

Is it possible refreshes don’t work the same way with the parallelized model? I’ve run it for 220 CPU hours per thread (which is about how many CPU hours it took non-parallelized) and while the no-parallelized one updated at the expected refresh rate, I haven’t gotten a refresh from this model (though I can see it’s still going full-speed ahead based on CPU utilization).

Chain 4 Iteration: 100 / 2000 [ 5%] (Warmup)
Chain 1 finished unexpectedly!
Chain 2 finished unexpectedly!
Chain 3 finished unexpectedly!
Chain 4 finished unexpectedly!
All chains finished unexpectedly!
Use read_cmdstan_csv() to read the results of the failed chains.No chains finished successfully. Unable to retrieve the fit.The working directory was changed to /Users/v685573 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.

I suppose parallelizing it didn’t work out. Oh well.

(I suspect this was because the model etc was on a network drive which became disconnected for some reason)