Slow warmup for spline with less informative prior information

I am fitting something closely analogous to a hierarchical logistic regression model with a spline function of one x variable. My question concerns warmup iterations. Typically I have more informative constraints on the spline parameters (obtained from a preliminary ML fit). See typical chain plot below.

When I allow a very uninformative prior on the spline, these parameters converge somewhat more slowly.

My general question concerns speeding up the convergence of similar runs of this model. Are there parameters of the warmup phase that I could do a better job tuning after fitting this model once that would make future runs quicker?


See stan code below:

data {
  int<lower=1> N;  // total number of observations
  int<lower=1> NC; // total number of observations for controls
  int<lower=0> Y[N];
  int Y_control[NC];
  int trials[N];  // response variable
  int trials_control[NC]; // Trials for controls  
  matrix[N, 9] X; // X matrix
  //matrix[9, 9] x_cov;  
  //vector[9] beta_mu;
  int<lower=1> N_seasonXfield;
  int<lower=1> seasonXfield_ID[N];
  int<lower=1> N_fieldXdate; 
  int fieldXdate_ID[N]; 
  int fieldXdate_control_ID[NC];
  int<lower=1> N_can;  // number of grouping levels
  int<lower=1> can_ID[N];  // grouping indicator per observation
  int<lower=1> N_shoot;  // number of grouping levels
  int<lower=1> N_shoot_control;  // number of grouping levels
  int<lower=1> shoot_ID[N];  // grouping indicator per observation
  int<lower=1> shoot_control_ID[NC];  // grouping indicator per observation
  int<lower=1> N_bud;  // number of grouping levels
  int<lower=1> N_bud_control;  // number of grouping levels
  int<lower=1> bud_ID[N];  // grouping indicator per observation
  int<lower=1> bud_control_ID[NC];  // grouping indicator per observation

  int prior_only;

parameters {
  vector[9] beta;
  real <lower=0> S_seasonXfield; 
  vector[N_seasonXfield] z_seasonXfield;
  vector[N_fieldXdate] z_fieldXdate;
  vector[N_fieldXdate] z_fieldXdate_fi; // direct prior on these
  real <lower=0> S_fieldXdate;
  real <lower=0> S_fieldXdate_fi;
  real <lower=0> S_can;
  vector[N_can] z_can;  
  real <lower=0> S_Shoot; 
  vector[N_shoot] z_Shoot;
  //real <lower=0> S_Shoot_fi; 
  //vector[N_shoot] z_Shoot_fi;
  vector[N_shoot_control] z_Shoot_control;  
  real <lower=0> S_bud;
  vector[N_bud] z_bud;
  //real <lower=0> S_bud_fi;
  //vector[N_bud] z_bud_fi;
  vector[N_bud_control] z_bud_control;  

transformed parameters {
  vector[N_seasonXfield] r_seasonXfield;
  vector[N_fieldXdate] r_fieldXdate;
  vector[N_fieldXdate] r_fieldXdate_fi;
  vector[N_can] r_can; 
  vector[N_shoot] r_Shoot;
  //vector[N_shoot] r_Shoot_fi;
  vector[N_shoot_control] r_Shoot_control;  
  vector[N_bud] r_bud; 
  //vector[N_bud] r_bud_fi;
  vector[N_bud_control] r_bud_control;  

  r_seasonXfield = S_seasonXfield * z_seasonXfield;
  r_fieldXdate = z_fieldXdate*S_fieldXdate;
  r_fieldXdate_fi = z_fieldXdate_fi*S_fieldXdate_fi;
  r_can = S_can * z_can; 
  r_Shoot = S_Shoot * z_Shoot;
  //r_Shoot_fi = S_Shoot_fi * z_Shoot_fi;
  r_Shoot_control = S_Shoot * z_Shoot_control;
  r_bud = S_bud * z_bud;
  //r_bud_fi = S_bud_fi * z_bud_fi;
  r_bud_control = S_bud * z_bud_control;


model {
  if(!prior_only) {
    vector[N] lo;
    vector[NC] loc;  
    lo = X * beta + r_seasonXfield[seasonXfield_ID] + r_fieldXdate[fieldXdate_ID] + r_can[can_ID]  + r_Shoot[shoot_ID] + r_bud[bud_ID];
    loc = r_fieldXdate_fi[fieldXdate_control_ID] + r_Shoot_control[shoot_control_ID] + r_bud[bud_control_ID];

    target += binomial_lpmf(Y | trials, inv_logit(lo) .* inv_logit(r_fieldXdate_fi[fieldXdate_ID]+r_Shoot[shoot_ID])); // r_Shoot_fi[shoot_ID] doubles the time required, having to compute this
    //target += binomial_logit_lpmf(Y | trials, lo);
    Y_control ~ binomial_logit(trials_control, loc);

  //to_vector(beta) ~ multi_normal(beta_mu, x_cov);
  beta ~ normal(0, 6);
  z_seasonXfield ~ normal(0, 1);
  S_seasonXfield ~ inv_gamma(3, 0.5);
  z_fieldXdate ~ normal(0, 1);
  z_fieldXdate_fi ~ normal(0, 1);
  S_fieldXdate ~ inv_gamma(3,3);//normal(3, 6); // Change to a invgamma
  S_fieldXdate_fi ~ inv_gamma(3,3);//normal(3, 6); // Change to a invgamma

  z_can ~ normal(0, 1);
  S_can ~ inv_gamma(3, 3);
  z_Shoot ~ normal(0, 1);
  //z_Shoot_fi ~ normal(0, 1);
  z_Shoot_control ~ normal(0, 1);
  S_Shoot ~ inv_gamma(3, 3); //3,3
  //S_Shoot_fi ~ inv_gamma(3,3);//inv_gamma(4,1);//normal(1,0.15);//inv_gamma(3,3);//normal(1,0.15);//inv_gamma(5, 3);//normal(1,0.15);//;//;//;
  z_bud ~ normal(0, 1);
  z_bud_control ~ normal(0, 1);
  S_bud ~ inv_gamma(3, 3);
  //S_bud_fi ~ normal(1,0.15);

first is the x-axis on your graph iter? Then mixing after iter = 100 seems fast enough.
Re slow mixing in spline and use “information from preliminary ML fit”: Perhaps a good idea is to obtain MLE (or Type-2 MAP if you want) estimate of your spline coefficient first, which I believe is what you mean by preliminary ML fit, then start your sampler near these points (you can use user-specified initial values in Stan).


You are right, 100 iterations isn’t that long (though I probably need a bit more to be safe). I know I’m going to be running this 50 more times and so the increase in time needed for the mixing is going to likely going to translate into more lag time on the modeling front. I am already starting the initial conditions by random simulation from the covariance matrix obtained from the ML estimate, but that is only the spline coefficients. I’ll try initializing the variance components to reasonable values also.

Wondering what you think about using prior distribution obtained from the ML estimates and if that is bad practice. I was setting the prior covariance matrix for the spline directly as 100 x ML_Cov, and this makes the mixing faster. Is this enough to make it likely that the prior distribution will not have a major effect on the solution, but will at least keep the spline in a reasonable range. If so, how to argue this. If not good practice, would it at least be good workflow to set prior covariance from ML estimate until all the other kinks are worked out (given that I’ve run both with ML covariance prior and less informative prior and posterior seems pretty similar)?


I believe that stan by default uses the first 150 iterations for warmup. In your case, warmup = 50 seems not really a high number for that purpose. That being said, I think it is generally a good practice to use an informative prior to stabilize your computation. I do not know about ``100 x ML_Cov’': it depends on your sample size n, if n is large then the estimated ML_Cov will be rather small. Maybe that is why you obtain normal(1,0.15)?

Thanks. I think this brings up the question, how would you obtain an informative prior for a spline (in general). Seems like the solution might be to think of the prior distribution as being set by assuming that the slope of the spline on any given segment is unlikely to be more than a certain value. Not sure if this seems like a reasonable way to constrain the spline coefficients. I think it would not result in a simple set of priors, however, as the slope at any given x depends on multiple spline parameters.

I would try it first with a softer constraint rather than a hard interval constraint.

Here’s the relevant doc: 15.2 HMC algorithm parameters | Stan Reference Manual

By default it’s 1000 warmup iterations, with 75 in Phase I, 50 in Phase III, and the remaining 875 in Phase II.

Bob - I agree with you to do a softer constraint. Thanks!