How to run a Stan program recursively?

Hi,

I have the following program in Stan:

data {
int<lower=0> N;
int<lower=0> y[N];
real my;
}
parameters {
real<lower=0> omega;
real<lower=0, upper=1> alpha;
real<lower=0, upper=1> beta;
}
model {
real theta[N];
theta[1] = omega+alphamy+betamy;
y[1] ~ poisson(theta[1]);
for (n in 2:N){
theta[n] = omega+alphay[n-1]+betatheta[n-1];
y[n] ~ poisson(theta[n]);
}
}

which runs just fine in cmdstan, rstan and cmdstanr. I would like to run it from “n in 2:N-100” and save the posterior, then from “n in 2:N-99” and save the posterior, …, and finally from “n in 2:N” and save the posterior, by having everything in one program and calling the same dataset of length N. Is this possible? Currently the only way I manage to do this is to run the above program separately for each “n in 2:N-i”, i=0,1,…,100 and also create separate data files of the correct length for each case i. This if of course time consuming if N is large.

Thank you! Adriana

1 Like

Hi Adriana,
Unless I’m missing something, almost none of the computation is redundant between the different fits that you’re looking for, so we’re not looking for any super-clever speedup tricks in the Stan code itself. The main trick is to come up with a way to format the data so that the dataset changes minimally between fits, and so that the model definition does not change (and therefore does not need recompilation). And you’re pretty much there already. The model file requires no changes at all. All you need to do is iteratively update the data file, and the only change you need to make is in the value of N. Thus, you can do something like:

my_model <- cmdstanr::cmdstan_model(...)
max_N <- ... #whatever the highest value of N is
for(i in 0:100){
  my_data_list$N <- max_N - i
  my_model$sample(data = my_data_list)
}

Cheers
Jacob

Edit: That said, you can also get speedup by vectorizing the loop in your code, regardless of this issue of needing to run the model multiple times.

it would be nice if stan could handle this sort of thing but I don’t think it does. The only way I’ve achieved something like this is by creating an extra parameter, which I convert to an integer and use for indexing in some form, and ignore the gradient/s from that element/s. This only works if you’ve also got your own optimizer / sampler going though, at which point stan is ‘just’ a log prob and gradient calculator.
edit: maybe I’m wrong, I thought you wanted to do it without updating the data file (which then also requires creating a new stan object for fitting).

You could make your parameters into a vector of 100 each and then fit each parameter only on the corresponding data. But this is the worst possible solution.

You can probably save some/much of the time by reusing the state of the sampler after warmup (if the data are identical / subsets).

Thank you Jacob. But it doesn’t work. I get the error message:

mismatch in dimension declared and found in context; processing stage=data initialization; variable name=y1; position=0; dims declared=(352); dims found=(354).

So here max_N=354 and I have (i in 0:2).

Adriana

1 Like

And this is my data with max_N=354:

my_data_list ← list(N=354,
y = c(0,2,3,5,2,1,1,5,3,8,18,10,14,12,12,9,28,31,47,55,59,77,94,144,157,151,178,153,205,228,285,319,316,241,274,202,216,320,338,318,304,299,260,268,325,314,474,373,351,365,418,400,349,563,450,591,414,422,510,489,714,575,620,385,384,351,441,414,458,352,245,297,425,375,405,355,336,259,244,261,388,295,382,267,235,205,254,231,254,274,200,165,177,191,197,206,179,155,121,151,187,173,226,197,247,193,159,189,216,230,257,202,163,116,156,172,156,143,152,149,98,133,112,131,104,107,77,90,120,103,126,97,92,85,98,135,133,119,86,100,97,101,115,100,103,106,111,88,91,143,150,163,140,99,81,85,144,146,157,177,120,104,97,221,185,171,179,161,108,79,126,122,140,165,129,89,93,150,135,119,192,161,170,143,192,263,373,335,302,312,305,524,380,393,461,384,313,297,386,419,544,569,557,559,665,615,714,728,922,801,822,1022,1449,1381,1786,2131,2202,1656,1677,2266,2414,2658,2425,1868,1917,1622,2743,2531,2773,2484,2432,2366,2219,3829,3565,3493,3301,2975,2248,2149,3850,3580,3578,3459,3467,2628,2820,4667,3948,3505,3586,3634,2832,2890,4491,3773,3704,3200,3076,2503,2093,3276,3059,2584,2090,1938,1366,1350,2058,1761,1605,1396,1315,1094,1038,1705,1485,1338,1151,1052,877,864,1403,1344,1338,1219,1233,950,1027,1718,1491,1228,1173,1450,1017,1413,1980,1867,1629,1252,590,1659,1958,2121,3221,2946,2165,1477,2736,2429,3467,2814,2576,2441,2094,1852,1746,2864,2512,2231,1201,1961,1694,1643,2526,2268,2100,1823,1657,1338,1326,1973,1747,1817,1595,1488,1231,1265,1709,1028,1747,1564,1369,1122,1134,1383,1308,1239,1216,1185,900,954,1375,1208),
my = 973.4915254)

Yup, you’re right, I overlooked that the Stan program uses N to declare the dimension of y. To avoid this issue, try the following Stan program and R script:

data {
  int<lower=0> N;
  int<lower=0> n_max;
  int<lower=0> y[n_max];
  real my;
}
parameters {
  real<lower=0> omega;
  real<lower=0, upper=1> alpha;
  real<lower=0, upper=1> beta;
}
model {
  real theta[N];
  theta[1] = omega+alpha*my+beta*my;
  y[1] ~ poisson(theta[1]);
  for (n in 2:N){
    theta[n] = omega+alpha*y[n-1]+beta*theta[n-1];
    y[n] ~ poisson(theta[n]);
  }
}
library(cmdstanr)  
my_model <- cmdstan_model(...)
my_data_list <-list(N=NA,
                    n_max = 354,
                    y = c(0,2,3,5,2,1,1,5,3,8,18,10,14,12,12,9,28,31,47,55,59,77,94,144,157,151,178,153,205,228,285,319,316,241,274,202,216,320,338,318,304,299,260,268,325,314,474,373,351,365,418,400,349,563,450,591,414,422,510,489,714,575,620,385,384,351,441,414,458,352,245,297,425,375,405,355,336,259,244,261,388,295,382,267,235,205,254,231,254,274,200,165,177,191,197,206,179,155,121,151,187,173,226,197,247,193,159,189,216,230,257,202,163,116,156,172,156,143,152,149,98,133,112,131,104,107,77,90,120,103,126,97,92,85,98,135,133,119,86,100,97,101,115,100,103,106,111,88,91,143,150,163,140,99,81,85,144,146,157,177,120,104,97,221,185,171,179,161,108,79,126,122,140,165,129,89,93,150,135,119,192,161,170,143,192,263,373,335,302,312,305,524,380,393,461,384,313,297,386,419,544,569,557,559,665,615,714,728,922,801,822,1022,1449,1381,1786,2131,2202,1656,1677,2266,2414,2658,2425,1868,1917,1622,2743,2531,2773,2484,2432,2366,2219,3829,3565,3493,3301,2975,2248,2149,3850,3580,3578,3459,3467,2628,2820,4667,3948,3505,3586,3634,2832,2890,4491,3773,3704,3200,3076,2503,2093,3276,3059,2584,2090,1938,1366,1350,2058,1761,1605,1396,1315,1094,1038,1705,1485,1338,1151,1052,877,864,1403,1344,1338,1219,1233,950,1027,1718,1491,1228,1173,1450,1017,1413,1980,1867,1629,1252,590,1659,1958,2121,3221,2946,2165,1477,2736,2429,3467,2814,2576,2441,2094,1852,1746,2864,2512,2231,1201,1961,1694,1643,2526,2268,2100,1823,1657,1338,1326,1973,1747,1817,1595,1488,1231,1265,1709,1028,1747,1564,1369,1122,1134,1383,1308,1239,1216,1185,900,954,1375,1208),
                    my = 973.4915254)

outputs <- list()
for(i in 0:100){
  my_data_list$N <- my_data_list$n_max - i
  outputs[[i+1]] <- my_model$sample(data = my_data_list)
}

It works! Thank you Jacob!