I’m new to stan and I could use a bit of feedback on this model. Is there anything big I’m overlooking? Does something stand out to you?
data {
int<lower=1> J; // number of apps
int<lower=3> L; // number of years
int<lower=1> N; // number of observations, the rows
int K; // number of features, the columns
int<lower=0> y[N]; // dependent variable index by the number of rows
matrix[N, K] X; // trend covariates
matrix[N, K] X2; // action covariates
matrix[N, L] X3; // year covariates
int IDs[N]; // apps
}
parameters {
real mu_alpha;
real<lower=0> tau_alpha; // hyper parameter for alpha variance
// I split out the intercept from other paraters in the design matrix;
vector[J] alpha; // the intercepts indexed by J apps
vector[K] beta; // a feature parameter K in length
vector[L] year; // a feature parameter L years in length
real mu_year;
real<lower=0> tau_year;
vector<lower=0>[K] action_slope; // a feature parameter K in length
vector<lower=0>[J] theta;
real<lower=0> mu_theta;
real<lower=0> tau_theta;
real mu_action_slope;
real<lower=0> tau_action_slope;
real mu_beta;
real<lower=0> tau_beta;
}
model {
mu_year ~ normal(0.3, 0.5);
tau_year ~ normal(0.8, 0.2);
year ~ normal(mu_year, tau_year);
mu_theta ~ normal(1, 5);
tau_theta ~ normal(3.6, 0.25);
theta ~ normal(mu_theta, tau_theta);
mu_alpha ~ normal(1, 2);
tau_alpha ~ normal(2.3, 0.5);
alpha ~ normal(mu_alpha, tau_alpha);
mu_beta ~ normal(0.07, 0.03);
tau_beta ~ normal(0.07, 0.04);
for(k in 1:K) {
beta[k] ~ normal(mu_beta, tau_beta);
}
mu_action_slope ~ normal(0, 0.1);
tau_action_slope ~ normal(0.12, 0.8);
for(k in 1:K) {
action_slope[k] ~ normal(mu_action_slope, tau_action_slope);
}
y ~ neg_binomial_2_log(alpha[IDs] + X * beta + X2 * action_slope + X3 * year, theta[IDs]);
}
generated quantities {
real predicted_tasks[N];
real marginal_tasks[N];
matrix[N, K] X2_extra_actions;
for(n in 1:N) {
for(k in 1:K) {
if(X2[n, k] > 0) {
X2_extra_actions[n, k] = X2[n, k] + 1;
} else {
X2_extra_actions[n, k] = 0;
}
}
predicted_tasks[n] = exp(alpha[IDs[n]] + X[n] * beta + (X2[n]) * action_slope + X3[n] * year);
marginal_tasks[n] = exp(alpha[IDs[n]] + X[n] * beta + (X2_extra_actions[n]) * action_slope + X3[n] * year);
}
}
A few things I’ve learned so far:
Shorten iter to 1200 and set chain = 1 when prototyping
Divergences can be caused by non-sensical priors
Divergencies can occur when you need a hyperparameter but instead have hard-coded a parameter
hyperparameters can be defined over a for loop of vectors
use_cache = FALSE with pars to avoid really slow wait times for print via summary.
If there are divergences look for the params with low n_eff and R_hat > 1 then focus on their prior, also consider if a hyperparameter is needed.
bayesplot::ppc_ecdf_overlay is also a good way to find gross model errors that likely lead to divergences.
One of the most important things to do is “center” or “scale” your data. Basically do what you can center and scaling wise such that the posterior can be as close to normal(0, 1) as possible. You want the posterior to be easy to explore and don’t want to be estimating any really large parameters like beta(1, 1e6) for example.
If I’ve got any of the above wrong corrections would be appreciated. Thank you.
Those are both instances of model misspecification.
It’s the parameters that should be as standard normal as possible in the posterior. You can sometimes do that by standardizing predictors and outcomes in a regression. You are not doing that in your model, which contains non-unit scale statements like this
mu_beta ~ normal(0.07, 0.03);
To center and scale that, you’d write
parameters {
real mu_beta_raw;
...
transformed parameters {
real mu_beta = 0.07 + 0.03 * mu_beta_raw;
...
model {
mu_beta_raw ~ normal(0, 1);
...
You also want to vectorize those sampling statements, e.g.,
beta ~ normal(mu_beta, tau_beta);
without the loop.
I take it alpha[id] is an exposure term. It can be more useful to use a time-series prior if you have data evolving over years than a regression on the year. The regression assumes the time effect is monotonic.
Welcome on board! Great start with your model. I agree with Bob’s tips.
On the question of scaling or standardizing, I prefer the former since scaling by a constant doesn’t assume the mean and sd are known a priori.
In rstanarm the default is to do some rescaling of the prior standard deviations for regression coefficients based on observed standard deviations, but that’s just so we can make the default priors not terrible.
It’s basically an approximation to making the model hierarchical and learning the values from the data instead of assuming them. That can work well sometimes, but just be aware of the assumptions
Okay so my understanding is that my original model with the non-centered parameters given no divergences would yield appropriate inference as well? The main difference being computational time?
I ask because when I switched to centered I got a cryptic communication error with the sockets. Plan to debug next week. It smells like I blew out memory or something (need to check). If could trust the unscaled version I’d have a nice fallback.
I was able to verify that that the serialization error above comes is due to being out of memory. I used the free command to log memory usage and sure enough I ran out,
nohup watch -n 1 'free -mh' > mem.log &
I noticed in another thread that either Bob or Ben said rstan allocates memory up front while CmdStan does not (it streams?). I think I’ll give that a shot. I also might give the parallel::mcapply and sampling() route a shot: https://github.com/stan-dev/rstan/issues/243#issuecomment-164560369
I don’t think that’s true. I tried the sample_file arg on stan but memory still ballooned.
Since standardizing in transform parameters adds a significant amount of overhead and most of my priors are centered between 0 - 5 with standard deviations of +/- 2 does standardizing buy me much?
Could I move these back to the model block and specify non-centered priors with practically no loss given my scales aren’t too out of whack here? I know that if I do this I lose the nice interpretability of the standardized parameters.
I think if I can safely do that it could let me fit the model on the full larger dataset.
As an aside, I tried out an AR(1) process. I specified a hierarchical rho parameter bounded between -1 and 1. I was surprised that when the model returned the n_eff had dropped significantly for the rest of the covariates. I’m not yet exactly sure how to diagnose this type of issue but it’s one that I’m going to read up on.
Do it in the model block to have less overhead. You can do the same thing in transformed parameters or in the model block and only difference is whether the values are stored or not.
When you don’t have the constraint, is there any probability mass past the boundaries? If so, the hard boundary will likely introduce difficult geometry that will drive down the overall n_eff rate per iteration. With a hard boundary, it pushes the unconstrained lower or upper bound to plus or minus infinity, which is more difficult computationally to adapt for step size. It will also bias the fit compared to the unconstrained parameter, as all that mass that would be beyond the boundary is now piled up at the boundary, so any expectation will be further away from the boundary.