Running model in a HPC and would like to save intermediate outputs

Dear stan users:

I am using Rstan to run a relatively complex model in a cluster. I initially set the wall times to be 8 hours but after 8 hours of waiting, my job got killed and I looked into the log file the job was only 40% complete and exceeded the set wall time.

Gradient evaluation took 0.05 seconds
1000 transitions using 10 leapfrog steps per transition would take 500 seconds.
Adjust your expectations accordingly!

Iteration: 1 / 500 [ 0%] (Warmup)
Iteration: 1 / 500 [ 0%] (Warmup)
Iteration: 1 / 500 [ 0%] (Warmup)
Iteration: 50 / 500 [ 10%] (Warmup)
Iteration: 50 / 500 [ 10%] (Warmup)
Iteration: 50 / 500 [ 10%] (Warmup)
Iteration: 50 / 500 [ 10%] (Warmup)
Iteration: 100 / 500 [ 20%] (Warmup)
Iteration: 100 / 500 [ 20%] (Warmup)
Iteration: 100 / 500 [ 20%] (Warmup)
Iteration: 100 / 500 [ 20%] (Warmup)
Iteration: 150 / 500 [ 30%] (Warmup)
Iteration: 150 / 500 [ 30%] (Warmup)
Iteration: 150 / 500 [ 30%] (Warmup)
Iteration: 150 / 500 [ 30%] (Warmup)
Iteration: 200 / 500 [ 40%] (Warmup)
Iteration: 200 / 500 [ 40%] (Warmup)
Iteration: 200 / 500 [ 40%] (Warmup)
Iteration: 200 / 500 [ 40%] (Warmup)
=>> PBS: job killed: walltime 28836 exceeded limit 28800

My model runs 2000 iterations with 500 warmup per chain for 4 chains and I would like to know are there any ways that I can
a) save the first 40% of the simulation outputs?
b) perhaps speed up the programme?

I am very new to HPC and stuff, any ideas would be greatly appreciated.

I have attached my Rscript, stan code and pbs file in this post.

Thanks in advance,
Jaslene2_comp_full_classical.r (756 Bytes)
full_state_space_true_classical.stan (5.5 KB)
2_comp_full_classical.txt (274 Bytes)

Note that you can set the sample_file parameter and your samples will be streamed to a file (e.g. even when the computation is interrupted, all/most samples will be stored). You can then use read_stan_csv to read this file in R.

As for the speed - very often problems in speed indicate problems with the model. However anything is hard to debug at this scale. Have you tested the model with small-scale data you created by simulating the exact model you have implemented? (e.g. draw the parameters from their priors and then generate some data) Did the model recover the parameters you draw from the simulation?

Also there might be room for vectorization, e.g.:

   for(t in 1:TD){
   lambda[t,1] = inv_logit(eta[t]); // if define lambda is a matrix[tt,2]
   lambda[t,2] = 1-lambda[t,1];
   mu2[t] = lambda[t,1]/lambda[t,2] * (-mu1[t]);
 
}  

Should work (not tested) and be faster like this:

   lambda[,1] = inv_logit(eta); // if define lambda is a matrix[tt,2]
   lambda[,2] = 1-lambda[,1];
   mu2 = lambda[,1] ./ lambda[,2] .* (-mu1);

Note the ./ and .* operators which correspond to element-wise multiplacation/division (Stan defaults to dot product and matrix division for vector and matrix operators).

As noted in the Stan manual, you should NOT vectorize the log_mix part as that actually is a different computation. Everything else should be vectorized.

1 Like

thank you for your reply I will give it a try

for setting the sample_file parameter? I am wondering where do I set it? in the parameter block in stan model?
or in the model object, ie here stan(file=‘full_state_space_true_classical.stan’,
data=ll,chains=4,
iter=500, control= list(adapt_delta = 0.9),verbose=FALSE,seed = 123456)

sorry, I don’t quite get it how to set the sample_file to save intermediate simulated posterior values

hope you could elaborate a little bit more thanks

Hi Martin:

Yes, I have tested a smaller model ( modelling only 1-day of data) it works fine, the model recovers the ‘true’ parameters values that I use to simulate the data and only takes moderate amount of time to run. Only when I extend to the full model, modelling 1-week of data then it gets very very slow and in R, it gives comments like
1000 transitions using 10 leapfrog steps per transition would take 500 seconds.

Sorry, I was not clear the sample_file is a parameter to the stan and sampling functions in R (check the docs for RStan for more details).

As for the speed: it is possible that your data is just too big - what is the actual N you work with? If I get it correctly you are modelling a mixture of two AR models with some spline predictors mixed in?

You say it works with one day worth of data, how about 2/3 days? Where do the problems start? It is also possible that model has some problems when there are multiple days - are you sure it is identifiable (has only a single maximum in posterior density)? If the individual chains explore different parts of the parameter space, it is a strong indication that the model is not identifiable.

Apart from vectorization, you may try to move the data to the sparse/long format - I don’t thoroughly understand your code, but it seems like you are using a lot of matrices that are mostly zero, e.g. you could have something like:

data {
  int N;
  vector[N] obs;
  int day_of_the_week[N];
  int hour_of_day[N];
 ...
}

This could simplify a lot of your code (I still don’t completely understand it, so this is a guess).

Note that you can index by a vector, so you should be able to do things like beta_dayofwk[day_of_the_week] to get beta for each observation.

Where you’re doing:

for(t in 1:TD){
  for(i in 1:N){
    target +=log_mix(lambda[t,1], normal_lpdf(obs[i,t]| y_overall_mean[t]+mu1[t], sigma[1]),
                                        normal_lpdf(obs[i,t]| y_overall_mean[t]+mu2[t], sigma[2]));
  }
}

you’re computing both y_overall_mean[t]+mu1[t] and y_overall_mean[t]+mu2[t] repeatedly (N times); depending on how big N is, you might get a smidge of a speedup by pre-computing those values in the outer loop:

for(t in 1:TD){
  real mean1 = y_overall_mean[t]+mu1[t] ;
  real mean2 = y_overall_mean[t]+mu2[t] ;
  for(i in 1:N){
    target += log_mix(
      lambda[t,1]
      , normal_lpdf( obs[i,t] | mean1 , sigma[1] )
      ,  normal_lpdf( obs[i,t] | mean2 , sigma[2] ) 
    );
  }
}

And there might be a speedup by computing mean1 & mean2 in one go at the outset from the vectors:

vector[TD] mean1 = y_overall_mean + mu1 ;
vector[TD] mean2 = y_overall_mean + mu2 ;
for(t in 1:TD){
  for(i in 1:N){
    target += log_mix(
      lambda[t,1]
      , normal_lpdf( obs[i,t] | mean1[t] , sigma[1] )
      ,  normal_lpdf( obs[i,t] | mean2[t] , sigma[2] ) 
    );
  }
}

As Martin says, you could probably benefit from sparse representations; I have a demo in the context of splines here

Oh, I take that back, I see now that you’re only using 6 spline bases, which is well below the 100ish that I found was necessary to see a speedup for the sparse version.

… and I think it is better to save all those N terms in a vector and then add that to target at once.

When doing all those tricks, you can watch the startup message from stan where it tells you the # of seconds for doing 1k gradient evaluations. This is a rough estimate of speedup when compared to where you started.

Oh yeah, I always forget that trick; would you need to do

target += sum(my_collector_vector) ;

or would:

target += my_collection_vector ;

work just as well?

That should be the same thing at the end of the day.

Hi Martin:

Thanks I found the sample_file parameter in stan() and I save samples as csv file and ran it on KATANA HPC cluster and running for 8 hours and my current model ( hasn’t implemented all the suggestions mentioned below) it only completed 10% of the whole 2000 iterations for 4 chains but I did see 4 csv files in my Filezilla, so at least I know I could have partial samples (though not trustable at this stage as it is still in warm-up stage) even if my job doesn’t finish.
I will implement all the suggestions as hopefully it will speed up things.

thanks all for your suggestions, truly appreciated.

There are three precision matrix in my model that are very sparse, they are A_hrofday, A_dayofwk and A_hrofweek

They are the known penalty precision matrix that I set for cyclic B-spline coefficients. Currently, I import these matrix as data and do some manipulation in transformed data block.

Later, they have been used in model block
target += multi_normal_prec_lpdf (beta_daily| zeros ,A_hrofday);

target += multi_normal_prec_lpdf (beta_weekly| zeros_1 ,A_dayofwk);

target += multi_normal_prec_lpdf (beta_int| zeros_2 ,A_hrofwk);

Their sizes are 77, 77 and the last one is 168 * 168.

For the last matrix A_hrofwk (168*168) there are actually 2520 zeros’ so I am wondering are there smarter ways of importing these sparse precision matrix in stan, I think it would cut down the computational time dramatically.

I had a look at Guihub and find a functional spec: sparse matrix data type,
I would like to do something like

sparse_matrix[M, N, mm, nn] a;
where the number of non-zero entries is K and the types of the above are:

int<lower=0> M;
int<lower=0> N;
int<lower=1, upper=M> mm[K];
int<lower=1, upper=N> nn[K];

However, I do think it would work for my last matrix of size 168*168, it would take quite a long time for me specify the position of the non-zero entries in the matrix I think.

Actually, the last large matrix is a Kronecker product of 2 sparse matrices, I am wondering are there any smarter ways of writing/importing these matrices into my model?

thanks in advance,
Jaslene

You should by all means use a cholesky decomposition representation. So please do in the transformed data block a L = cholesky_decompose(A) call and then use the multi_normal_cholesky_lpdf(…, L) call. That should give you some speedup.

If you can pull out of the matrices variables which are completley independent of the others, then this could be worth while to do. Since the co-variance matrix is data (as I understood) these operations are not super-critical, I think (but please use the cholesky).

Thanks, I will incorporate this into my model and now it takes about 370 seconds as opposed to 500 seconds for 1000 transitions using 10 leapfrog steps that is a promising sign.

That message will tell you the log density and gradient evaluation time, but how long your program will actually take depends on how many leapfrog steps you need. You can often get away with shorter runs than our defaults for well-behaved models.

If you continue with this and want more feedback, it helps to keep reposting the current form of the model.

Amazing to me still how much help there is available for modeling on this list!

Hi Jaslene - would you mind sharing your syntax with the sample_file parameter? Would like to replicate this. Thanks!

For R, the syntax with sample_file parameter ( source file https://cran.r-project.org/web/packages/rstan/rstan.pdf):

fit ← stan(model_code = stanmodelcode, model_name = “example”,
data = dat, iter = 2020, chains = 4, verbose = TRUE,
sample_file = file.path(tempdir(), ‘norm.csv’))

1 Like