Looking for a stan program critique

model.stan (2.2 KB)

Hi there! :wave:

I’m new to stan and I could use a bit of feedback on this model. Is there anything big I’m overlooking? Does something stand out to you?

data {
  int<lower=1> J; // number of apps
  int<lower=3> L; // number of years
  int<lower=1> N; // number of observations, the rows
  int K; // number of features, the columns
  int<lower=0> y[N]; // dependent variable index by the number of rows
  matrix[N, K] X; // trend covariates
  matrix[N, K] X2; // action covariates
  matrix[N, L] X3; // year covariates
  int IDs[N]; // apps
}
parameters {
  real mu_alpha;
  real<lower=0> tau_alpha; // hyper parameter for alpha variance
  // I split out the intercept from other paraters in the design matrix;
  vector[J] alpha; // the intercepts indexed by J apps
  vector[K] beta; // a feature parameter K in length
  vector[L] year; // a feature parameter L years in length
  real mu_year;
  real<lower=0> tau_year;
  vector<lower=0>[K] action_slope; // a feature parameter K in length
  vector<lower=0>[J] theta;
  real<lower=0> mu_theta;
  real<lower=0> tau_theta;
  real mu_action_slope;
  real<lower=0> tau_action_slope;
  real mu_beta;
  real<lower=0> tau_beta;
}
model {
  mu_year ~ normal(0.3, 0.5);
  tau_year ~ normal(0.8, 0.2);
  year ~ normal(mu_year, tau_year);

  mu_theta ~ normal(1, 5);
  tau_theta ~ normal(3.6, 0.25);
  theta ~ normal(mu_theta, tau_theta);

  mu_alpha ~ normal(1, 2);
  tau_alpha ~ normal(2.3, 0.5);
  alpha ~ normal(mu_alpha, tau_alpha);

  mu_beta ~ normal(0.07, 0.03);
  tau_beta ~ normal(0.07, 0.04);
  for(k in 1:K) {
    beta[k] ~ normal(mu_beta, tau_beta);
  }

  mu_action_slope ~ normal(0, 0.1);
  tau_action_slope ~ normal(0.12, 0.8);
  for(k in 1:K) {
    action_slope[k] ~ normal(mu_action_slope, tau_action_slope);
  }

  y ~ neg_binomial_2_log(alpha[IDs] + X * beta + X2 * action_slope + X3 * year, theta[IDs]);
}
generated quantities {
  real predicted_tasks[N];
  real marginal_tasks[N];
  matrix[N, K] X2_extra_actions;
  for(n in 1:N) {
    for(k in 1:K) {
      if(X2[n, k] > 0) {
        X2_extra_actions[n, k] = X2[n, k] + 1;
      } else {
        X2_extra_actions[n, k] = 0;
      }
    }
    predicted_tasks[n] = exp(alpha[IDs[n]] + X[n] * beta + (X2[n]) * action_slope + X3[n] * year);
    marginal_tasks[n] = exp(alpha[IDs[n]] + X[n] * beta + (X2_extra_actions[n]) * action_slope + X3[n] * year);
  }
}

A few things I’ve learned so far:

  • Shorten iter to 1200 and set chain = 1 when prototyping
  • Divergences can be caused by non-sensical priors
  • Divergencies can occur when you need a hyperparameter but instead have hard-coded a parameter
  • hyperparameters can be defined over a for loop of vectors
  • use_cache = FALSE with pars to avoid really slow wait times for print via summary.
  • If there are divergences look for the params with low n_eff and R_hat > 1 then focus on their prior, also consider if a hyperparameter is needed.
  • bayesplot::ppc_ecdf_overlay is also a good way to find gross model errors that likely lead to divergences.
  • One of the most important things to do is “center” or “scale” your data. Basically do what you can center and scaling wise such that the posterior can be as close to normal(0, 1) as possible. You want the posterior to be easy to explore and don’t want to be estimating any really large parameters like beta(1, 1e6) for example.

If I’ve got any of the above wrong corrections would be appreciated. Thank you.

1 Like

Those are both instances of model misspecification.

It’s the parameters that should be as standard normal as possible in the posterior. You can sometimes do that by standardizing predictors and outcomes in a regression. You are not doing that in your model, which contains non-unit scale statements like this

mu_beta ~ normal(0.07, 0.03);

To center and scale that, you’d write

parameters {
  real mu_beta_raw;
  ...
transformed parameters {
  real mu_beta = 0.07 + 0.03 * mu_beta_raw;
  ...
model {
  mu_beta_raw ~ normal(0, 1);
  ...

You also want to vectorize those sampling statements, e.g.,

beta ~ normal(mu_beta, tau_beta);

without the loop.

I take it alpha[id] is an exposure term. It can be more useful to use a time-series prior if you have data evolving over years than a regression on the year. The regression assumes the time effect is monotonic.

1 Like

Thank you Bob. These are all very helpful clues and give me new avenues to learn / improve the model.

Welcome on board! Great start with your model. I agree with Bob’s tips.

On the question of scaling or standardizing, I prefer the former since scaling by a constant doesn’t assume the mean and sd are known a priori.

In rstanarm the default is to do some rescaling of the prior standard deviations for regression coefficients based on observed standard deviations, but that’s just so we can make the default priors not terrible.

It’s basically an approximation to making the model hierarchical and learning the values from the data instead of assuming them. That can work well sometimes, but just be aware of the assumptions

Hope that helps!

2 Likes

Neat—hadn’t thought of it that way before.

2 Likes

In case it’s of interest, here’s a thread that has some other views on related topics:

2 Likes

Thank you both! I’d say so far the parameter centering and choosing priors was the hardest part for me to figure out.

This final comment is really interesting: Standardizing predictors and outputs in a hierarchical model

Okay so my understanding is that my original model with the non-centered parameters given no divergences would yield appropriate inference as well? The main difference being computational time?

I ask because when I switched to centered I got a cryptic communication error with the sockets. Plan to debug next week. It smells like I blew out memory or something (need to check). If could trust the unscaled version I’d have a nice fallback.

So strange I’m hitting this error when I up the dataset using the transform parameters centering trick (it’s still fairly smallish N ~ 15k),

On Ubuntu 14.04, I get Error in unserialize(socklist[[n]]) : error reading from connection — sounds related to: Unserialize error

I wasn’t running into the same issue before centering the parameters.

I was able to verify that that the serialization error above comes is due to being out of memory. I used the free command to log memory usage and sure enough I ran out,

nohup watch -n 1 'free -mh' > mem.log &

I noticed in another thread that either Bob or Ben said rstan allocates memory up front while CmdStan does not (it streams?). I think I’ll give that a shot. I also might give the parallel::mcapply and sampling() route a shot: https://github.com/stan-dev/rstan/issues/243#issuecomment-164560369

You can also stream out in R. What I’ve never understood is whether you can turn off saving in memory.

You can also do fewer draws (rarely does anyone need more than a couple hundred effective sample size) or save fewer parameters.

I don’t think that’s true. I tried the sample_file arg on stan but memory still ballooned.

Since standardizing in transform parameters adds a significant amount of overhead and most of my priors are centered between 0 - 5 with standard deviations of +/- 2 does standardizing buy me much?

Could I move these back to the model block and specify non-centered priors with practically no loss given my scales aren’t too out of whack here? I know that if I do this I lose the nice interpretability of the standardized parameters.

I think if I can safely do that it could let me fit the model on the full larger dataset.

As an aside, I tried out an AR(1) process. I specified a hierarchical rho parameter bounded between -1 and 1. I was surprised that when the model returned the n_eff had dropped significantly for the rest of the covariates. I’m not yet exactly sure how to diagnose this type of issue but it’s one that I’m going to read up on.

What if you use sample_file in combination with pars=""?

1 Like

Do it in the model block to have less overhead. You can do the same thing in transformed parameters or in the model block and only difference is whether the values are stored or not.

1 Like

When you don’t have the constraint, is there any probability mass past the boundaries? If so, the hard boundary will likely introduce difficult geometry that will drive down the overall n_eff rate per iteration. With a hard boundary, it pushes the unconstrained lower or upper bound to plus or minus infinity, which is more difficult computationally to adapt for step size. It will also bias the fit compared to the unconstrained parameter, as all that mass that would be beyond the boundary is now piled up at the boundary, so any expectation will be further away from the boundary.

1 Like

@jonah Using sample_file and pars = "", I get the error message,

no parameter ; sampling not done.

@avehtari’s advice was a big help! I moved the centering process to the model block it cut down the memory footprint a good bit as expected.

You can do the same thing in transformed parameters or in the model block and only difference is whether the values are stored or not.