What happens when the iteration progress stops/freezes?

I was running a Stan model (on AWS). It got up to 400/2000 iterations and then stopped there for 30+ hours. What could cause this?

Thanks!

Most likely, because this is during adaptation, the stepsize was adjusted to be << 1e-6 and it just looks like the model froze. The best way to avoid this is to re-parameterize the model but there’s not generic advice for re-parameterizing that always leads to success.

You can check this by writing adaptation output to a file (the save_warmup=1 parameter in CmdStan, something similar in rstan), and loading the .csv file that gets saved. It should have a stepsize__ column.

Thanks! I should mention, when I ran the model the first time I used the defaults and got this warning:
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
so I increased to max_treedepth = 15.

The model I’m running is a slight modification of the one in this vignette:
http://mc-stan.org/users/documentation/case-studies/icar_stan.html

could you share the modified code? there’s a few models in that vignette…

Sure!

We want to estimate rates of y = 1 at the level of HSA (hospital service area). HSAs are nested into HRRs (hospital referral regions).

Thanks so much!

p.s. what’s the best way to format code in these message boards so it isn’t awful to look at? Mitzi answered!

data {
  // control flags
  int<lower=0, upper=1> t_prior; // 0=normal, 1=t
  int<lower=0, upper=1> has_spatial; // 0=false, 1=true
  int<lower=0, upper=1> has_covar; // 0=false, 1=true
  
  // array sizes
  int N_prac;
  int N_hsa;
  int N_hrr;
  int N_covar;
  int N_edges_hrr;
  int N_edges_hsa;
  
  // index information for group membership
  int i_hsa[N_prac];
  int i_hrr[N_prac];
  
  // data
  matrix[N_prac * has_covar, N_covar] X;
  int y[N_prac];
  
  // spatial/adjacency
  int node_hsa1[N_edges_hsa * has_spatial];
  int node_hsa2[N_edges_hsa * has_spatial];
  int node_hrr1[N_edges_hrr * has_spatial];
  int node_hrr2[N_edges_hrr * has_spatial];
}
transformed data {
  matrix[N_prac * has_covar, N_covar] Q_ast;
  matrix[N_covar, N_covar] R_ast;
  matrix[N_covar, N_covar] R_ast_inverse;
  if (has_covar) {
    // thin and scale the QR decomposition
    Q_ast = qr_Q(X)[, 1:N_covar] * sqrt(N_prac - 1);
    R_ast = qr_R(X)[1:N_covar, ] / sqrt(N_prac - 1);
    R_ast_inverse = inverse(R_ast);
  }
}
parameters {
  real alpha; // intercept
  vector[N_covar * has_covar] theta; // slopes of Q_ast (QR decomposition)
  vector[N_hsa] a_j_raw; // random hsa effects
  vector[N_hrr] b_k_raw; // random hrr effects
  vector[N_hsa * has_spatial] phi_j_std; // spatial hsa effects
  // vector[N_hrr * has_spatial] psi_k_std; // spatial hrr effects
  real<lower=0> tau_a; // precision of random hsa effects
  real<lower=0> tau_b; // precision of random hrr effects
  real<lower=0> tau_phi; // precision of spatial hsa effects
  //real<lower=0> tau_psi; // precision of spatial hrr effects
}
transformed parameters {
  vector[N_prac] eta_i; // linear predictor
  vector[N_hsa] a_j; 
  vector[N_hrr] b_k;  
  vector[N_hsa * has_spatial] phi_j;
  // vector[N_hrr * has_spatial] psi_k;
  real<lower=0> sigma_a;
  real<lower=0> sigma_b;
  real<lower=0> sigma_phi;
  // real<lower=0> sigma_psi;
  
  // convert precisions to sigmas
  sigma_a = inv(sqrt(tau_a));
  sigma_b = inv(sqrt(tau_b));
  sigma_phi = inv(sqrt(tau_phi));
  // sigma_psi = inv(sqrt(tau_psi));
  
  // matt trick random effects
  a_j = a_j_raw * sigma_a;
  b_k = b_k_raw * sigma_b;
  // matt trick spatial effects
  phi_j = phi_j_std * sigma_phi;
  // psi_k = psi_k_std * sigma_psi;
    
  // linear predictor
  eta_i = alpha + a_j[i_hsa] + b_k[i_hrr];
  if (has_covar)
    eta_i += Q_ast * theta;
  if (has_spatial)  
    eta_i += phi_j[i_hsa]; // += psi_k[i_hrr] + phi_j[i_hsa];
}
model {
  for (i in 1:N_prac) y[i] ~ bernoulli(inv_logit(eta_i[i])); // could vectorize?
  
  // weakly-informative priors
  alpha ~ normal(0,1);
  theta ~ normal(0,1);
  
  // raw for random effects
  if (t_prior == 0) {
    a_j_raw ~ normal(0, 1);
    b_k_raw ~ normal(0, 1);
  } else if (t_prior == 1) {
    a_j_raw ~ student_t(4, 0, 1);
    b_k_raw ~ student_t(4, 0, 1);
  }
  
  // Carlin WinBUGS priors on spatial precision hyperparameters
  tau_phi ~ gamma(1, 1);
  // tau_psi ~ gamma(1, 1);
  
  // spatial effects
  if (has_spatial) {
    // ICAR prior
    target += -0.5 * dot_self(phi_j_std[node_hsa1] - phi_j_std[node_hsa2]);
    // target += -0.5 * dot_self(psi_k_std[node_hrr1] - psi_k_std[node_hrr2]);
    // soft sum-to-zero constraint
    sum(phi_j_std) ~ normal(0, 0.001 * N_hsa);
    // sum(psi_k_std) ~ normal(0, 0.001 * N_hrr);
    
    // Carlin WinBUGS priors on random effects precision hyperparameters
    tau_a ~ gamma(3.2761, 1.81);
    tau_b ~ gamma(3.2761, 1.81);
  } else {
    tau_a ~ normal(0,1);
    tau_b ~ normal(0,1);
  }
}
generated quantities {
  vector[N_covar] beta;
  if (has_covar) {
    beta = R_ast_inverse * theta; // coefficients on X
  }
}

have your tried using different priors? the Carlin WINBUGS priors help Gibbs samplers (?) but aren’t appropriate for HMC.

set the code by putting a line consisting of three backticks “```” before and after your code.
here’s the formatted code:

data {
  // control flags
  int<lower=0, upper=1> t_prior; // 0=normal, 1=t
  int<lower=0, upper=1> has_spatial; // 0=false, 1=true
  int<lower=0, upper=1> has_covar; // 0=false, 1=true

  // array sizes
  int N_prac;
  int N_hsa;
  int N_hrr;
  int N_covar;
  int N_edges_hrr;
  int N_edges_hsa;

  // index information for group membership
  int i_hsa[N_prac];
  int i_hrr[N_prac];

  // data
  matrix[N_prac * has_covar, N_covar] X;
  int y[N_prac];

  // spatial/adjacency
  int node_hsa1[N_edges_hsa * has_spatial];
  int node_hsa2[N_edges_hsa * has_spatial];
  int node_hrr1[N_edges_hrr * has_spatial];
  int node_hrr2[N_edges_hrr * has_spatial];
}
transformed data {
  matrix[N_prac * has_covar, N_covar] Q_ast;
  matrix[N_covar, N_covar] R_ast;
  matrix[N_covar, N_covar] R_ast_inverse;
  if (has_covar) {
    // thin and scale the QR decomposition
    Q_ast = qr_Q(X)[, 1:N_covar] * sqrt(N_prac - 1);
    R_ast = qr_R(X)[1:N_covar, ] / sqrt(N_prac - 1);
    R_ast_inverse = inverse(R_ast);
  }
}
parameters {
  real alpha; // intercept
  vector[N_covar * has_covar] theta; // slopes of Q_ast (QR decomposition)
  vector[N_hsa] a_j_raw; // random hsa effects
  vector[N_hrr] b_k_raw; // random hrr effects
  vector[N_hsa * has_spatial] phi_j_std; // spatial hsa effects
  // vector[N_hrr * has_spatial] psi_k_std; // spatial hrr effects
  real<lower=0> tau_a; // precision of random hsa effects
  real<lower=0> tau_b; // precision of random hrr effects
  real<lower=0> tau_phi; // precision of spatial hsa effects
  //real<lower=0> tau_psi; // precision of spatial hrr effects
}
transformed parameters {
  vector[N_prac] eta_i; // linear predictor
  vector[N_hsa] a_j;
  vector[N_hrr] b_k;
  vector[N_hsa * has_spatial] phi_j;
  // vector[N_hrr * has_spatial] psi_k;
  real<lower=0> sigma_a;
  real<lower=0> sigma_b;
  real<lower=0> sigma_phi;
  // real<lower=0> sigma_psi;

  // convert precisions to sigmas
  sigma_a = inv(sqrt(tau_a));
  sigma_b = inv(sqrt(tau_b));
  sigma_phi = inv(sqrt(tau_phi));
  // sigma_psi = inv(sqrt(tau_psi));

  // matt trick random effects
  a_j = a_j_raw * sigma_a;
  b_k = b_k_raw * sigma_b;
  // matt trick spatial effects
  phi_j = phi_j_std * sigma_phi;
  // psi_k = psi_k_std * sigma_psi;

  // linear predictor
  eta_i = alpha + a_j[i_hsa] + b_k[i_hrr];
  if (has_covar)
    eta_i += Q_ast * theta;
  if (has_spatial)
    eta_i += phi_j[i_hsa]; // += psi_k[i_hrr] + phi_j[i_hsa];
}
model {
  for (i in 1:N_prac) y[i] ~ bernoulli(inv_logit(eta_i[i])); // could vectorize?

  // weakly-informative priors
  alpha ~ normal(0,1);
  theta ~ normal(0,1);

  // raw for random effects
  if (t_prior == 0) {
    a_j_raw ~ normal(0, 1);
    b_k_raw ~ normal(0, 1);
  } else if (t_prior == 1) {
    a_j_raw ~ student_t(4, 0, 1);
    b_k_raw ~ student_t(4, 0, 1);
  }

  // Carlin WinBUGS priors on spatial precision hyperparameters
  tau_phi ~ gamma(1, 1);
  // tau_psi ~ gamma(1, 1);

  // spatial effects
  if (has_spatial) {
    // ICAR prior
    target += -0.5 * dot_self(phi_j_std[node_hsa1] - phi_j_std[node_hsa2]);
    // target += -0.5 * dot_self(psi_k_std[node_hrr1] - psi_k_std[node_hrr2]);
    // soft sum-to-zero constraint
    sum(phi_j_std) ~ normal(0, 0.001 * N_hsa);
    // sum(psi_k_std) ~ normal(0, 0.001 * N_hrr);

    // Carlin WinBUGS priors on random effects precision hyperparameters
    tau_a ~ gamma(3.2761, 1.81);
    tau_b ~ gamma(3.2761, 1.81);
  } else {
    tau_a ~ normal(0,1);
    tau_b ~ normal(0,1);
  }
}
generated quantities {
  vector[N_covar] beta;
  if (has_covar) {
    beta = R_ast_inverse * theta; // coefficients on X
  }
}

working through your model, hope to have further comments soon.

I haven’t yet, no! What kind of priors do you recommend for HMC?

Thanks so much!

Ah, I realized that we haven’t yet tried the BYM2 parametrization (again, see Mitzi’s awesome write-up: http://mc-stan.org/users/documentation/case-studies/icar_stan.html). Mitzi, do you think that would help a lot here?

That’s a good point. I wonder how hard it would be to change the code to have a minimum stepsize that you can’t go below. That way instead of this “freezing” or slowind down to a crawl behavior happening you just get divergences or lower accept rates. I think this would “fail faster” and be a more informative clue to the user that they should re-parameterize.

I should look into how hard that would be to do. I think @Bob_Carpenter mentioned that one of the upcoming releases of Stan will allow you to specify the stepsize manually.

My preference is to deal with the output s.t. we can make online monitoring easier. I have an abstract due so I didn’t get a chance to work on it this weekend like I wanted to … :( Would be pretty easy to fire a warning if the stepsize drops, but I’ve also had models recover (b/c of mass matrix scaling adaptation, I think). Even currently you can load a partial .csv file and plotting log10(stepsize__) is a solid way of spotting the problem… we just don’t have tooling in place to make it easy for users.

Yeah, a warning is probably better, because you don’t always want to halt sampling, since like you said, the mass matrix adaptation can let you recover from too tiny of a stepsize. I guess with a warning the user can decide whether they want to halt sampling or try their luck and see if the adaptation will fix itself.

Yeah, that’s a good idea. I think ShinyStan might be able to plot the stepsize used during each warmup sample, but you have to wait until after sampling has completed.

yes, most definitely the BYM2 parameterization is what you want to handle combined heterogenous and spatial random effects.

sorry I didn’t see this message sooner.