Time Limit on Super Computer; How to Finish from Where Iterations Left Off?

Hi all,

I am using a super computer with a 48-hour time-limit, and I have four different time points (four jobs), only one of which finishes on time. The others return with nothing; which is a sign that the job was too long. I am wondering how I can work around this…maybe to set the four jobs to only finish 1000 iterations, and then start the four jobs again from where those left off. This would seem to be much easier than going the parallel coding route, which would seem to require a steeper learning curve.

I’ve listed first, the current (second) model I am using. I created the second model because the posteriors for time point 3 didn’t seem to show that as response window gets longer, participants perform better.

There are ~1000 participants and fourteen conditions for which I am estimating thresholds via psychometric functions. As an aside, I’m not sure why the current model is taking longer than the first model I created, but really it’s the first question that is most important.

data{  // Everything that must be input by the user
  int<lower=4> n_total;                         // Total number of trials to analyze
  int<lower=2> n_subjects;                      // Number of unique observers
  int<lower=2> n_levels;                        // Number of levels of Factor
  real intensity[n_total];                      // Intensity of the stimulus on each trial
  int<lower=0> subject[n_total];                // Observer on each trial
  int<lower=0> level[n_total];                  // Level of Factor on each trial
  int<lower=0,upper=1> correct[n_total];        // Whether the response was correct (1) on each trial
  real<lower=0,upper=1> chance_performance;     // Chance performance for experiment (e.g., 1/n for n-AFC)
}
transformed data{
  int df_levels_sA;          // Number of degrees of freedom in the interaction
  int n_levels_sA;          // Number of levels in the interaction
  
  df_levels_sA = (n_levels - 1) * (n_subjects - 1);
  n_levels_sA = n_levels * n_subjects;
}
parameters{
  vector<lower=0,upper=1>[n_subjects] lapse;        // Observer's lapse rate
  real mum;
  real<lower=0> muw;
  vector[n_levels-1] fA;
  vector[n_subjects-1] sA;
  vector[n_subjects-1] sB;
  //matrix[n_subjects-1,n_levels-1] fsA;
  vector[(n_subjects - 1) * (n_levels-1)] fsA;
}

transformed parameters {
  real m[n_subjects,n_levels];
  //real w[n_subjects,n_levels];
  real lapse_alpha;
  real lapse_beta;
  vector [n_total] psi;
  {
    real mean_beta;              // Mean of beta prior for individual lapse rate
    real betaEta;                       // Precision parameter of beta prior for individual lapse rate
    real threshold[n_total];
    real width[n_total];
    vector[n_levels] factor_alpha = append_row(fA, -sum(fA));
    vector[n_subjects] subject_alpha = append_row(sA, -sum(sA));
    vector[n_subjects] subject_beta = append_row(sB, -sum(sB));
    matrix[n_subjects,n_levels] interaction_alpha;


    interaction_alpha = rep_matrix(0, n_subjects, n_levels);
    for(sj in 1:(n_subjects - 1)) {
    for(l in 1:(n_levels - 1)) {
    interaction_alpha[sj, l] = fsA[(sj - 1) * (n_levels - 1) + l];
    }
    interaction_alpha[sj, n_levels] = -1 * sum(interaction_alpha[sj,]);
    }
    for (l in 1:n_levels) {
    interaction_alpha[n_subjects,l] = -1 * sum(interaction_alpha[,l]);
    }


    for (sj in 1:n_subjects) {
    for (l in 1:n_levels) {
    m[sj,l] = mum + factor_alpha[l] + subject_alpha[sj] + interaction_alpha[sj, l];
    //w[sj,l] = exp(muw + subject_beta[sj]);
    }
    }

    for (tr in 1:n_total) {
    threshold[tr] = m[subject[tr],level[tr]];
    //width[tr] = w[subject[tr],level[tr]];
    psi[tr] = chance_performance
                  + (1 - lapse[subject[tr]] - chance_performance)
                  * inv_logit((intensity[tr]-threshold[tr])/muw);
    if (is_nan(psi[tr])) {
      print("lapse = ", lapse[subject[tr]]);
      print("width = ", width[tr]);
      print("threshold = ", threshold[tr]);
    }
    }

    mean_beta  = 0.01;
    betaEta = 100;
    lapse_alpha = mean_beta * betaEta;
    lapse_beta = (1-mean_beta) * betaEta ;
  }
}

model {
//mean_beta ~ beta(1,60);
//betaEta ~ gamma(1,0.01);
fsA ~ normal(0, inv(sqrt(1 - inv(n_levels_sA)))); // Produces a standard normal on on interaction_alpha
fA ~ normal(0, inv(sqrt(1 - inv(n_levels))));
sA ~ normal(0, inv(sqrt(1 - inv(n_subjects)))); // Produces a standard normal on on interaction_alpha
sB ~ normal(0, inv(sqrt(1 - inv(n_subjects)))); // Produces a standard normal on on interaction_alpha

mum ~ gamma(2,4);
muw ~ gamma(2,4);
lapse ~ beta(lapse_alpha,lapse_beta);
correct ~ bernoulli(psi);
}

The first model I created

data{  // Everything that must be input by the user
  int<lower=4> n_total;                         // Total number of trials to analyze
  int<lower=2> n_subjects;                      // Number of unique observers
  int<lower=2> n_levels;                        // Number of levels of Factor
  real intensity[n_total];                      // Intensity of the stimulus on each trial
  int<lower=0> subject[n_total];                // Observer on each trial
  int<lower=0> level[n_total];                  // Level of Factor on each trial
  int<lower=0,upper=1> correct[n_total];        // Whether the response was correct (1) on each trial
  real<lower=0,upper=1> chance_performance;     // Chance performance for experiment (e.g., 1/n for n-AFC)
}
transformed data{
  int df_levels_sA;          // Number of degrees of freedom in the interaction
  int n_levels_sA;          // Number of levels in the interaction
  
  df_levels_sA = (n_levels - 1) * (n_subjects - 1);
  n_levels_sA = n_levels * n_subjects;
}
parameters{
  vector<lower=0,upper=1>[n_subjects] lapse;        // Observer's lapse rate
  real mum;
  real muw;
  vector[n_levels-1] fA;
  vector[n_subjects-1] sA;
  vector[n_subjects-1] sB;
  //matrix[n_subjects-1,n_levels-1] fsA;
  vector[(n_subjects - 1) * (n_levels-1)] fsA;
}

transformed parameters {
  real m[n_subjects,n_levels];
  real w[n_subjects,n_levels];
  real lapse_alpha;
  real lapse_beta;
  vector [n_total] psi;
  {
    real mean_beta;              // Mean of beta prior for individual lapse rate
    real betaEta;                       // Precision parameter of beta prior for individual lapse rate
    real threshold[n_total];
    real width[n_total];
    vector[n_levels] factor_alpha = append_row(fA, -sum(fA));
    vector[n_subjects] subject_alpha = append_row(sA, -sum(sA));
    vector[n_subjects] subject_beta = append_row(sB, -sum(sB));
    matrix[n_subjects,n_levels] interaction_alpha;


    interaction_alpha = rep_matrix(0, n_subjects, n_levels);
    for(sj in 1:(n_subjects - 1)) {
    for(l in 1:(n_levels - 1)) {
    interaction_alpha[sj, l] = fsA[(sj - 1) * (n_levels - 1) + l];
    }
    interaction_alpha[sj, n_levels] = -1 * sum(interaction_alpha[sj,]);
    }
    for (l in 1:n_levels) {
    interaction_alpha[n_subjects,l] = -1 * sum(interaction_alpha[,l]);
    }


    for (sj in 1:n_subjects) {
    for (l in 1:n_levels) {
    m[sj,l] = mum + factor_alpha[l] + subject_alpha[sj] + interaction_alpha[sj, l];
    w[sj,l] = exp(muw + subject_beta[sj]);
    }
    }

    for (tr in 1:n_total) {
    threshold[tr] = m[subject[tr],level[tr]];
    width[tr] = w[subject[tr],level[tr]];
    psi[tr] = chance_performance
                  + (1 - lapse[subject[tr]] - chance_performance)
                  * inv_logit((intensity[tr]-threshold[tr])/width[tr]);
    if (is_nan(psi[tr])) {
      print("lapse = ", lapse[subject[tr]]);
      print("width = ", width[tr]);
      print("threshold = ", threshold[tr]);
    }
    }

    mean_beta  = 0.01;
    betaEta = 100;
    lapse_alpha = mean_beta * betaEta;
    lapse_beta = (1-mean_beta) * betaEta ;
  }
}

model {
//mean_beta ~ beta(1,60);
//betaEta ~ gamma(1,0.01);
fsA ~ normal(0, inv(sqrt(1 - inv(n_levels_sA)))); // Produces a standard normal on on interaction_alpha
fA ~ normal(0, inv(sqrt(1 - inv(n_levels))));
sA ~ normal(0, inv(sqrt(1 - inv(n_subjects)))); // Produces a standard normal on on interaction_alpha
sB ~ normal(0, inv(sqrt(1 - inv(n_subjects)))); // Produces a standard normal on on interaction_alpha

mum ~ gamma(2,4);
muw ~ gamma(2,4);
lapse ~ beta(lapse_alpha,lapse_beta);
correct ~ bernoulli(psi);
}

I had commonly taken the final sample as the initialisation for the next iteration of the model. My old code used the extract() function and pulled the last row, creating an initialisation list for what each parameter was at the last sample, then feeding that back in as the initial point for the next model using “init” parameter. With a little more tweaking you can even set the NUTS calibration parameters and remove the need for a warmup.

BTW,
I’m not sure how this could work on your supercomputer (is it multicore or multinode) but I note that you can materially improve performance by using the reduce_sum, or map_rect functions to parallelise your model section.
On a supercomputer this might require cmdstan and MPI compilation. I have a xeon cpu with 10 cores so I can speed up my code considerably by using reduce_sum or map_rect, a single chain, but using 10 threads. My code runs almost 10 times faster and I am just using basic cmdstanr.

I am also surprised there is not an error thrown by “vector [n_total]” as there should be no space after vector as far as I am aware.

Hmmm…you’re talking about in the transformed parameters? Yeah, didn’t receive any error regarding that.

The super computer is multinode and multicore (so it’s simple to run the four chains on separate nodes).

I thought reduce_sum and map_rect were a bit more involved and required a decent lift to parallelize code? I figured it would just be simpler to have the computer run for a little longer (probably not the best solution, but a short-term easy fix). But if it’s not that difficult, I can have the two packages installed on the supercomputer. Any good info on adjusting code for cmdstan.

No,

I meant in your model section. Most of your loglikelihood statements can be accelerated easily with reduce_sum. It is well documented.Here is a good example:

I think you will be much happier if you use the multithreading capability. It is already built into cmdstan and once compiled into an executable, will automatically use multithreading i.e. there are no extra packages to install.

To use MPI multithreading is another story. This would multithread across the full power of the supercomputer. It would need to be compiled differently so there would be work to do here, but the code would likely run more than 10 times faster.

I would also highly recommend looking at brms implementations of your model. I am suspicious that, given some tinkering, your model could be written in lme4 type language as it is a bernoulli model at its core. Brms would dramatically reduce the amount of coding you needed to do as brms writes the stan code for you, and will automatically include code for multithreading.

something like:

model = brm(bf(correct~chance_performance
                  + (1 - lapse - chance_performance)
                  * inv_logit((intensity-threshold)/muw)
                   , lapse~subject
,threshold~ mum + factor_alpha + subject_alpha+ interaction_alpha #this needs work
,muw~1
,nl=true)
,family = bernoulli(link="Identity")
,data = yourdataframe # re-express your data in tabular format
,prior = # set your priors here
,backend=cmdstan
)

I hope this helps

My understanding (and experience) is that passing model parameter values is insufficient. You also need to pass the sampler parameters (stepsize and inverse metric). See Continue Warmup / Sampling after Interruption , specifically the response by @mitzimorris.

1 Like

I found this example by @ahartikainen to be very useful.

1 Like

Lol…I first began with BRMS and ultimately switched to STAN because I wasn’t sure how to pull out the threshold parameters for all of the 14 conditions for the ~1000 participants. The other issue I was uncertain with was that I knew how I’d code it in something like lme4 (below) but I wasn’t sure how to code the condition * norm interaction (where norm is the normalized response window) into BRMS.

glmer(response ~  condition * norm + (norm | pid/condition) 

This was the model I was working with in BRMS.

fit.form <- bf(
  response ~ gamma + (1 - lambda - gamma) * Phi((norm - threshold)/spread),
  threshold ~ 1 + (norm|p|pid) + (norm|condition),
  logitgamma ~ 1 + (1|p|pid),
  nlf(gamma ~ inv_logit(logitgamma)),
  logitlambda ~ 1 + (1|p|pid),
  nlf(lambda ~ inv_logit(logitlambda)),
  spread ~ 1 + (1|p|pid) + (1|condition),
nl = TRUE)

prior <-
  prior(beta(9, 3), class = "b", nlpar = "threshold", lb = 0, ub = 1) +
  prior(beta(1.4, 1.4), class = "b", nlpar = "spread", lb = .005, ub = .5) +
  prior(beta(.5, 8), nlpar = "logitlambda", lb = 0, ub = .1)+
  prior(beta(1, 5), nlpar = "logitgamma", lb = 0, ub = .1)

fit_form <- brm(
  formula = form.thresholds.phi.nonormparam.cond,
  data = ace.threshold.t1.samp,
  family = bernoulli(link = "identity"),
  prior = prior,
  control = list(adapt_delta = .85, max_treedepth = 15),
  inits = 0,
  chains = 1,
  cores = 16
)

If the condition*norm effect is additive as it is in your glmer formula, and condition is discrete then, apologies if my understanding is limited, you can add it to your response formula like this:

response ~ gamma + (1 - lambda - gamma) * Phi((norm - threshold)/spread) + w*norm,
w ~ condition,

so norm is multiplied by a factor that is different for each condition and added to the expected response. This is the non linear formula method for adding a linear term where the gradient varies based on the condition.

Or have I missed something?

BTW, if your issue was pulling out parameters because they are a function of several other things. I suggest you write the model as a specified brms family where the parameters that you want are specified as parameters in the family. This way they can be pulled out with add_fitted_draws function anddpar=TRUE. Automatically adds parameters defined in the family definition.

True enough. That’s what I meant by “a little more tweaking”, but I have had success in my case with just parameters set as initialisation. My model was very slow to converge so I just set it up with last iteration as the initialisation to continue burn in. I am sure my problems would have been abated faster had I known how to pass the sampler parameters as well.

after reading the post, it looks like the cmdstanr backend for brms would be recommended. Is there sample code out there for a brms implementation with cmdstanr backend that sets the inv_metric? This seems to be the hard part. See Continue Warmup / Sampling after Interruption .

Note that there seems to be a bug in achieving this; see discussion here.

Thanks, this is super helpful! At this point, I have someone helping me with STAN and a deep knowledge of psychometric thresholds, and I think I’m pretty close to being able to figure everything out. But I do think BRMS is a powerful and easier alternative. I’ll definitely play around with your suggestions though and see if I can get it to work.

The person who has been giving assistance is a researcher at another University, and he’s pretty busy, so I’m definitely trying to take on the bulk of figuring out loose ends…getting everything to work on the super computer within the two day limit.

I think the way forward at this point is using cmdstanr and the reduce_sum argument. I looked through the links you provided and couldn’t quite figure it out…and asked the question here (which has been unanswered): Using "Reduce Sum" to Parallelize Code in Multi-level Psychometric Model. Could you lend some advice on where I might be going wrong?

Sorry your Q has gone unanswered, I’ll respond there…

When you set up cmdstanr backend for brms, the generated code will show how to implement reduce_sum for each case. So I think you should start with some models in brms with cmdstanr backend and look at the generated code using the make_stancode function. This has always been very instructive for me.