Run time

Hi,
I am aware this is a vague question and am only looking for ideas and suggestions.

I am using NUTS for sampling spherical harmonics (SH). I am currently playing around with SH-degree 20-30, resulting in 500-1000 model parameters.
I use default settings: 4 chains with 1000 warm-up and 1000 post warm-up iterations, 10 tree-depth (never been necessary to increase). Always get Rhat=1.

Here’s the question:
Sometimes a run can take 4 hours and sometimes 24 hours (Independent of whether I increase of decrease the amount of model paramters).
I am not using the computer for anything else, thus nothing should interfere with the resources available. The applied model and data does not change. The posterior is uni-modal.
Do any of you have any suggestions as to why it can vary so much?

Code included below, but data is ~400 MB, so I have not included it

Thanks in advance
Michael

data {
    int<lower=0> L;	// Length of data-array
    int<lower=0> n;	// SH-degree
    int<lower=0> mMax;	// Amount of model parameters
    vector[L] d;		// Declare size of data-array
    matrix[L, mMax] G;	// Declare size of design matrix
    vector[mMax] limits;
}

parameters {
    vector[mMax] m;
}

model {
    vector[L] res;

    // Prior distribution
    for (i in 1:mMax) {
        m[i] ~ normal(0, limits[i]/2);
    }

    // Likelihood
    res <- d-G*m;
    res ~ normal(0,1);
}

which interface are you using? RStan? cmdstan? other?

pyStan
My apologies, I thought I had remembered that.

you could try running the model in cmdstan or you could try using package pycmdstan -https://pypi.org/project/pycmdstan/

pycmdstan is pretty cool, but the workflow isn’t exactly where we’d like it to be and we’re working on changing that (project cmdstanpy). the docs are a bit sketchy, but the source code is beautiful and easy to read.

or you could run via cmdstan, which has the least overhead - e.g. - run 4 chains in sequence using simple bash for loop:

for x in {1..4}
do 
  echo $x
 ./examples/bernoulli/bernoulli sample data file=examples/bernoulli/bernoulli.data.R random seed=12345 id=$x output file=output_${x}.csv
 done

Cool, I will definitely have a look at pycmdstan.
But why is it you think that would change deviations in run time?

memory leaks somewhere in pystan? just a guess.

  1. Is there a way for me to test that?
  2. Normally when sampling python is taking 100% of my CPU, spread out over 4 Python applications which I suppose is the 4 chains. Now only 1 application is running and by CPU is at 30%. Is it possible that 3/4 chains finished in 4 hours and one chain is still going?

One thing to keep in mind is that Rhat is a pretty crude diagnostic and you can still get bad fits even if the Rhats look fine. Such extreme variation in computation usually means that the chains are returning inconsistent results, which indicates a bad fit, but you can’t automatically rule out natural process variation.

To better isolate the problem you’ll want to look at the adaptation results. In PyStan you can grab the step size for the first chain with

model = stan_utility.compile_model('model_name.stan')
fit = model.sampling(data=data, seed=XXXXXX)

fit.get_sampler_params(inc_warmup=False)[0]['stepsize__'][0]

and then for the other three chains by changing the first index. In the same way you can grab the distribution of leapfrog steps for each chain with

fit.get_sampler_params(inc_warmup=False)

If the step sizes or distribution of leapfrog steps across chains are inconsistent then speed differences are indicative of a fitting problem. If they are the same then the speed differences are due to how the OS executed each process.

In general I recommend using the suite of diagnostics discussed in https://betanalpha.github.io/assets/case_studies/pystan_workflow.html.

1 Like

I believe I already have read most of the information in the link. And am saving the data output using “to_dataframe” and “summary”.
Now I cannot speak for the current run (SH degree 27 = 783 dimensions) since it is still going, but a similar thing happened at lower degree (SH degree 25 = 675 dimensions).
The tree depth is always 5, acceptance rate between 60-100%, step-size 0.11-0.12 for all four chains.
Would the conclusion then be OS (win10) problems as were suggested?

Given the same step size and tree depth distributions (unlike the step size the tree depth will vary from iteration to iteration so you want to look at the entire distribution) the performance should be identical. Technically the mass matrix is also a consideration so you’ll want to look at that as well (it should be in the fit object somewhere), but it would be odd if the mass matrices were different but the step sizes were the same.

I tried to upload the sample statistics, but it is in a .npy file, which is not authorized. But from what I can see the n_leapfrog is always 31. Which is a bit weird if you say it should vary, at least slightly.

Can you save the results as csv?

If you are using algorithm="NUTS", the default, I’d say it is very (extremely?) unlikely that the number of integration steps stays constant along the entire chain, that’s how regular HMC algorithm works, and probably the main motivation for a dynamic criterion for the number of time steps for each MCMC iteration is that there will not normally be a fixed number that will work most efficiently everywhere in parameter space.
So that is weird, but I think there are other you could give you better advice on that than I could if that keeps happening.

I haven’t sampled spherical harmonics before, but I did have an issue with greatly varying run times in PyStan for a model with many parameters. I found that some of the chains that ended faster were actually converging around a lower value of the posterior (this thread is about a different issue, but the PyStan traces illustrate what I’m talking about). I think that can be attributed to the step size and mass matrix optimized for the different chains, which was not good enough for some of them.

You said Rhat ~ 1, so that is probably not exactly the case, but it can still be that the traces are different during burnin and that could explain the difference in run time – a chain that takes longer to converge may be using very small time steps and wasting computation. So maybe it’s useful to look at the complete trace of 'lp__' and 'stepsize__' and see if that could be a reason.

I tried, but it exceeds the limit (68 MB) and zipped files are not allowed. I have attached a link (onedrive) from which it can be downloaded OneDrive
I believe you know the format, but just to be safe

Content:
C0: Chain
C1: Chain_idx
C2: Warmup
C3: Divergent
C4: Energy
C5: Treedepth
C6: Acc rate
C7: Step-size
C8: n_leapfrog
C9: Gauss 1

C end-1: Gauss n
C end: lp

Yes, I am using algorithm="NUTS".
Thanks for the thread link, I will definitely have a look at them