I was running a Stan model (on AWS). It got up to 400/2000 iterations and then stopped there for 30+ hours. What could cause this?
Thanks!
I was running a Stan model (on AWS). It got up to 400/2000 iterations and then stopped there for 30+ hours. What could cause this?
Thanks!
Most likely, because this is during adaptation, the stepsize was adjusted to be << 1e-6 and it just looks like the model froze. The best way to avoid this is to re-parameterize the model but there’s not generic advice for re-parameterizing that always leads to success.
You can check this by writing adaptation output to a file (the save_warmup=1
parameter in CmdStan, something similar in rstan), and loading the .csv file that gets saved. It should have a stepsize__
column.
Thanks! I should mention, when I ran the model the first time I used the defaults and got this warning:
http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
so I increased to max_treedepth = 15.
The model I’m running is a slight modification of the one in this vignette:
http://mc-stan.org/users/documentation/case-studies/icar_stan.html
could you share the modified code? there’s a few models in that vignette…
Sure!
We want to estimate rates of y = 1 at the level of HSA (hospital service area). HSAs are nested into HRRs (hospital referral regions).
Thanks so much!
p.s. what’s the best way to format code in these message boards so it isn’t awful to look at? Mitzi answered!
data {
// control flags
int<lower=0, upper=1> t_prior; // 0=normal, 1=t
int<lower=0, upper=1> has_spatial; // 0=false, 1=true
int<lower=0, upper=1> has_covar; // 0=false, 1=true
// array sizes
int N_prac;
int N_hsa;
int N_hrr;
int N_covar;
int N_edges_hrr;
int N_edges_hsa;
// index information for group membership
int i_hsa[N_prac];
int i_hrr[N_prac];
// data
matrix[N_prac * has_covar, N_covar] X;
int y[N_prac];
// spatial/adjacency
int node_hsa1[N_edges_hsa * has_spatial];
int node_hsa2[N_edges_hsa * has_spatial];
int node_hrr1[N_edges_hrr * has_spatial];
int node_hrr2[N_edges_hrr * has_spatial];
}
transformed data {
matrix[N_prac * has_covar, N_covar] Q_ast;
matrix[N_covar, N_covar] R_ast;
matrix[N_covar, N_covar] R_ast_inverse;
if (has_covar) {
// thin and scale the QR decomposition
Q_ast = qr_Q(X)[, 1:N_covar] * sqrt(N_prac - 1);
R_ast = qr_R(X)[1:N_covar, ] / sqrt(N_prac - 1);
R_ast_inverse = inverse(R_ast);
}
}
parameters {
real alpha; // intercept
vector[N_covar * has_covar] theta; // slopes of Q_ast (QR decomposition)
vector[N_hsa] a_j_raw; // random hsa effects
vector[N_hrr] b_k_raw; // random hrr effects
vector[N_hsa * has_spatial] phi_j_std; // spatial hsa effects
// vector[N_hrr * has_spatial] psi_k_std; // spatial hrr effects
real<lower=0> tau_a; // precision of random hsa effects
real<lower=0> tau_b; // precision of random hrr effects
real<lower=0> tau_phi; // precision of spatial hsa effects
//real<lower=0> tau_psi; // precision of spatial hrr effects
}
transformed parameters {
vector[N_prac] eta_i; // linear predictor
vector[N_hsa] a_j;
vector[N_hrr] b_k;
vector[N_hsa * has_spatial] phi_j;
// vector[N_hrr * has_spatial] psi_k;
real<lower=0> sigma_a;
real<lower=0> sigma_b;
real<lower=0> sigma_phi;
// real<lower=0> sigma_psi;
// convert precisions to sigmas
sigma_a = inv(sqrt(tau_a));
sigma_b = inv(sqrt(tau_b));
sigma_phi = inv(sqrt(tau_phi));
// sigma_psi = inv(sqrt(tau_psi));
// matt trick random effects
a_j = a_j_raw * sigma_a;
b_k = b_k_raw * sigma_b;
// matt trick spatial effects
phi_j = phi_j_std * sigma_phi;
// psi_k = psi_k_std * sigma_psi;
// linear predictor
eta_i = alpha + a_j[i_hsa] + b_k[i_hrr];
if (has_covar)
eta_i += Q_ast * theta;
if (has_spatial)
eta_i += phi_j[i_hsa]; // += psi_k[i_hrr] + phi_j[i_hsa];
}
model {
for (i in 1:N_prac) y[i] ~ bernoulli(inv_logit(eta_i[i])); // could vectorize?
// weakly-informative priors
alpha ~ normal(0,1);
theta ~ normal(0,1);
// raw for random effects
if (t_prior == 0) {
a_j_raw ~ normal(0, 1);
b_k_raw ~ normal(0, 1);
} else if (t_prior == 1) {
a_j_raw ~ student_t(4, 0, 1);
b_k_raw ~ student_t(4, 0, 1);
}
// Carlin WinBUGS priors on spatial precision hyperparameters
tau_phi ~ gamma(1, 1);
// tau_psi ~ gamma(1, 1);
// spatial effects
if (has_spatial) {
// ICAR prior
target += -0.5 * dot_self(phi_j_std[node_hsa1] - phi_j_std[node_hsa2]);
// target += -0.5 * dot_self(psi_k_std[node_hrr1] - psi_k_std[node_hrr2]);
// soft sum-to-zero constraint
sum(phi_j_std) ~ normal(0, 0.001 * N_hsa);
// sum(psi_k_std) ~ normal(0, 0.001 * N_hrr);
// Carlin WinBUGS priors on random effects precision hyperparameters
tau_a ~ gamma(3.2761, 1.81);
tau_b ~ gamma(3.2761, 1.81);
} else {
tau_a ~ normal(0,1);
tau_b ~ normal(0,1);
}
}
generated quantities {
vector[N_covar] beta;
if (has_covar) {
beta = R_ast_inverse * theta; // coefficients on X
}
}
have your tried using different priors? the Carlin WINBUGS priors help Gibbs samplers (?) but aren’t appropriate for HMC.
set the code by putting a line consisting of three backticks “```” before and after your code.
here’s the formatted code:
data {
// control flags
int<lower=0, upper=1> t_prior; // 0=normal, 1=t
int<lower=0, upper=1> has_spatial; // 0=false, 1=true
int<lower=0, upper=1> has_covar; // 0=false, 1=true
// array sizes
int N_prac;
int N_hsa;
int N_hrr;
int N_covar;
int N_edges_hrr;
int N_edges_hsa;
// index information for group membership
int i_hsa[N_prac];
int i_hrr[N_prac];
// data
matrix[N_prac * has_covar, N_covar] X;
int y[N_prac];
// spatial/adjacency
int node_hsa1[N_edges_hsa * has_spatial];
int node_hsa2[N_edges_hsa * has_spatial];
int node_hrr1[N_edges_hrr * has_spatial];
int node_hrr2[N_edges_hrr * has_spatial];
}
transformed data {
matrix[N_prac * has_covar, N_covar] Q_ast;
matrix[N_covar, N_covar] R_ast;
matrix[N_covar, N_covar] R_ast_inverse;
if (has_covar) {
// thin and scale the QR decomposition
Q_ast = qr_Q(X)[, 1:N_covar] * sqrt(N_prac - 1);
R_ast = qr_R(X)[1:N_covar, ] / sqrt(N_prac - 1);
R_ast_inverse = inverse(R_ast);
}
}
parameters {
real alpha; // intercept
vector[N_covar * has_covar] theta; // slopes of Q_ast (QR decomposition)
vector[N_hsa] a_j_raw; // random hsa effects
vector[N_hrr] b_k_raw; // random hrr effects
vector[N_hsa * has_spatial] phi_j_std; // spatial hsa effects
// vector[N_hrr * has_spatial] psi_k_std; // spatial hrr effects
real<lower=0> tau_a; // precision of random hsa effects
real<lower=0> tau_b; // precision of random hrr effects
real<lower=0> tau_phi; // precision of spatial hsa effects
//real<lower=0> tau_psi; // precision of spatial hrr effects
}
transformed parameters {
vector[N_prac] eta_i; // linear predictor
vector[N_hsa] a_j;
vector[N_hrr] b_k;
vector[N_hsa * has_spatial] phi_j;
// vector[N_hrr * has_spatial] psi_k;
real<lower=0> sigma_a;
real<lower=0> sigma_b;
real<lower=0> sigma_phi;
// real<lower=0> sigma_psi;
// convert precisions to sigmas
sigma_a = inv(sqrt(tau_a));
sigma_b = inv(sqrt(tau_b));
sigma_phi = inv(sqrt(tau_phi));
// sigma_psi = inv(sqrt(tau_psi));
// matt trick random effects
a_j = a_j_raw * sigma_a;
b_k = b_k_raw * sigma_b;
// matt trick spatial effects
phi_j = phi_j_std * sigma_phi;
// psi_k = psi_k_std * sigma_psi;
// linear predictor
eta_i = alpha + a_j[i_hsa] + b_k[i_hrr];
if (has_covar)
eta_i += Q_ast * theta;
if (has_spatial)
eta_i += phi_j[i_hsa]; // += psi_k[i_hrr] + phi_j[i_hsa];
}
model {
for (i in 1:N_prac) y[i] ~ bernoulli(inv_logit(eta_i[i])); // could vectorize?
// weakly-informative priors
alpha ~ normal(0,1);
theta ~ normal(0,1);
// raw for random effects
if (t_prior == 0) {
a_j_raw ~ normal(0, 1);
b_k_raw ~ normal(0, 1);
} else if (t_prior == 1) {
a_j_raw ~ student_t(4, 0, 1);
b_k_raw ~ student_t(4, 0, 1);
}
// Carlin WinBUGS priors on spatial precision hyperparameters
tau_phi ~ gamma(1, 1);
// tau_psi ~ gamma(1, 1);
// spatial effects
if (has_spatial) {
// ICAR prior
target += -0.5 * dot_self(phi_j_std[node_hsa1] - phi_j_std[node_hsa2]);
// target += -0.5 * dot_self(psi_k_std[node_hrr1] - psi_k_std[node_hrr2]);
// soft sum-to-zero constraint
sum(phi_j_std) ~ normal(0, 0.001 * N_hsa);
// sum(psi_k_std) ~ normal(0, 0.001 * N_hrr);
// Carlin WinBUGS priors on random effects precision hyperparameters
tau_a ~ gamma(3.2761, 1.81);
tau_b ~ gamma(3.2761, 1.81);
} else {
tau_a ~ normal(0,1);
tau_b ~ normal(0,1);
}
}
generated quantities {
vector[N_covar] beta;
if (has_covar) {
beta = R_ast_inverse * theta; // coefficients on X
}
}
working through your model, hope to have further comments soon.
I haven’t yet, no! What kind of priors do you recommend for HMC?
Thanks so much!
Ah, I realized that we haven’t yet tried the BYM2 parametrization (again, see Mitzi’s awesome write-up: http://mc-stan.org/users/documentation/case-studies/icar_stan.html). Mitzi, do you think that would help a lot here?
That’s a good point. I wonder how hard it would be to change the code to have a minimum stepsize that you can’t go below. That way instead of this “freezing” or slowind down to a crawl behavior happening you just get divergences or lower accept rates. I think this would “fail faster” and be a more informative clue to the user that they should re-parameterize.
I should look into how hard that would be to do. I think @Bob_Carpenter mentioned that one of the upcoming releases of Stan will allow you to specify the stepsize manually.
My preference is to deal with the output s.t. we can make online monitoring easier. I have an abstract due so I didn’t get a chance to work on it this weekend like I wanted to … :( Would be pretty easy to fire a warning if the stepsize drops, but I’ve also had models recover (b/c of mass matrix scaling adaptation, I think). Even currently you can load a partial .csv file and plotting log10(stepsize__)
is a solid way of spotting the problem… we just don’t have tooling in place to make it easy for users.
Yeah, a warning is probably better, because you don’t always want to halt sampling, since like you said, the mass matrix adaptation can let you recover from too tiny of a stepsize. I guess with a warning the user can decide whether they want to halt sampling or try their luck and see if the adaptation will fix itself.
Yeah, that’s a good idea. I think ShinyStan might be able to plot the stepsize used during each warmup sample, but you have to wait until after sampling has completed.
yes, most definitely the BYM2 parameterization is what you want to handle combined heterogenous and spatial random effects.
sorry I didn’t see this message sooner.