Running a Stan model many times in a for loop

I am running a simple Stan model many times using a for loop (using a different dataset in each iteration of the for loop).

For some reason the code is getting stuck (indefinitely) at random iterations during warmup. I am sure I had this issue before and I think the problem was resolved by asking Stan not to save the .rds output but I cannot remember how to do this. Can someone remind me how I prevent Stan saving the .rds files?

Are there any other suggestions that come to mind on why this is occurring and what can be done?

No, but it’d be a lot easier if you showed us the code you were using to run Stan in a loop. For example, are you running the loop iterations in parallel or sequentially?

Which interface are you using, RStan or CmdStanR? I don’t think either of them automatically saves .rds files.

Are you sure that this is a problem due to looping? I would suggest generating a random seed each iteration and passing that to Stan. Then monitor when you get stuck and see if the game seed causes you to get stuck when used outside of a loop.

The problem is that HMC can find its way to high curvature areas and then it takes a long time to get out again—this is the right behavior to get the correct long-run averages, but on the ground it looks like the chains are hung on a particular value. Usually, the only solution in these cases is to reparameterize.

1 Like

@Bob_Carpenter is right that there aren’t any RDS files saved by default. Is it possible you’re referring to something else?

If you’re going to run many times in a loop I would lean towards using CmdStanR. It’s doable with RStan but it’s possible that keeping everything in memory in R is causing the issue (could be something else entirely though). CmdStanR will save (temporary, unless otherwise specified) CSV files each time you run it, but it won’t read them into memory in R until you use a method that requires them.

If you definitely want to use RStan I recommend not using the stan() function for this and instead using stan_model() to compile the model and then sampling() to run MCMC repeatedly with different datasets. Something like this:

model <- stan_model(file)
fits <- list()
for (n in 1:seq_along(datasets)) {
  fits[[n]] <- sampling(model, data = datasets[[n]]) 
}

Although perhaps you’re already doing that. If the issue is running out of space when using RStan you could try Looping over many fits fills hard drive - #4 by sweenejo.

And, like @Bob_Carpenter said, it could be that in this particular case the issue isn’t the looping but that you’re running into some other issue, e.g. with specific seeds.

1 Like

I don’t think the problem is the loop anymore. The Stan file is shown below

data {
  int<lower=0> N;
  int<lower=0> n;
  int<lower=0> n_min;
  real<lower=0> a;
  real<lower=0> b;
}
parameters{
  real<lower=0,upper=1> lambda;
}
model{
  n ~ binomial(N, lambda);
  target+= - binomial_lccdf(n_min| N, lambda);
  //priors
  lambda ~ beta(a,b);
  print(lambda);
}

I tried running the Stan model with the following data and code:

a0=9; b0=2
set.seed(1)
lambda_init = rbeta(1,a0,b0)

model_trunc = stan_model(file="binomial_trunc.stan")
stan_list_trunc = list(N=25, n= 19, n_min = 17,a=9,b=2)
run_model_trunc = sampling(model_trunc, data = stan_list_trunc,iter =10000, chains = 1, init = list(list(lambda = lambda_init)), seed = 1)
  

The problem seems to arise when the chain gets stuck around 0 and it cannot seem to escape.

Chain 1: 0.744908

Chain 1: 0.963514

Chain 1: 1e-15

Chain 1: 1e-15

Chain 1: 0.999999

Chain 1: 1e-15

Chain 1: 1e-15

Chain 1: 1e-15

Chain 1: 1e-15

But the code runs fine on my laptop but not on the high performance computer that I am trying to run the code on. Strange.

I think your LCCDF needs to be n_min - 1, because the CCDF is the probability that a value is strictly greater, i.e., \text{CCDF} = 1 - \Pr[n \leq n^\text{min}] = \Pr[n > n^\text{min}]. We deal with that finagling for you if you use the truncation notation,

n ~ binomial(N, lambda) T[n_min, ];

For example, for the truncated distribution statement

4 ~ binomial(5, lambda) T[2, ];

here’s the C++ that gets generated on the back end:

        lp_accum__.add(stan::math::binomial_lpmf<propto__>(4, 5, lambda));
        current_statement__ = 3;
        if (stan::math::logical_lt(4, 2)) {
          current_statement__ = 3;
          lp_accum__.add(stan::math::negative_infinity());
        } else {
          current_statement__ = 3;
          lp_accum__.add(-(stan::math::binomial_lccdf((2 - 1), 5, lambda)));
        }

lp_accum__ is the target density accumulator. First, it adds the binomial part. Second it checks consistency. Then it uses 2 - 1 rather than 2 in the binomial LCCDF call.

With this model, I’d suggest constraining n to have <lower=n_min> to make sure the data’s actually consistent with what you have. Also, it’s going to be challenging to fit a model like this with a single data point.

1 Like

Good spot. It should have been n_min - 1.

Just to confirm. My data can take the values [n_min, ..., n]. I.e., n_min is an allowed value. Does the truncation T[n_min,] mean that it will sample n > n_{min} or n \geq n_{min}?

The truncation is inclusive, so T[min, ] for a discrete distribution means the value will be greater than or equal to min. That’s why it does the - 1 in the generated code.