'Pedantic Mode' is ready for your testing and feedback

I’ve built a prototype of ‘Pedantic Mode’, a tool for finding issues with Stan programs. It’s based some of the ideas from a previous discussion here. It’s an early version, but it’s ready for your testing and feedback (and bug reports)!

What is pedantic mode

When you compile a Stan program with Pedantic Mode turned on, it will search through your program for potential issues and point them out to you.

For example, if you compile the following program with Pedantic Mode:

data {
  int N;
  real x[N];
}
parameters {
  real sigma;
}
model {
  real mu;
  x ~ normal(mu, sigma);
}

It will spit out:

Warning:
  The parameter sigma has 0 priors.
Warning at line 10, column 13 to column 15:
  The variable mu may not have been initialized before its use.
Warning at line 10, column 17 to column 22:
  A normal distribution is given parameter sigma as a scale parameter
  (argument 2), but sigma was not constrained to be strictly positive.

Programmers might recognize this as a linter or -Wall. Pedantic mode aims to be a linter for statistical as well as programming issues.

Here are the kinds of issues that Pedantic Mode can currently look for:

  • Distribution arguments don’t match the distribution specification
  • Some distribution is used in an inadvisable way (e.g. uniform distributions)
  • Very large or very small constants are used as distribution arguments
  • Branching control flow (like if/else) depends on a parameter value, potentially introducing discontinuity
  • A parameter is defined but doesn’t contribute to target
  • A parameter is on the left-hand side of multiple twiddles
  • A parameter has more than one prior distribution
  • A parameter is assigned questionable bounds
  • A variable is used before being assigned a value

Here are some known limitations:

  • Indexed variables are not handled intelligently, so they’re treated conservatively (erring toward no warnings)
  • Data variables used as distribution arguments or variates are not currently checked against distribution specifications
  • Sometimes it’s impossible to know a variable’s value, like a distribution argument, before the program is run

There’s some more detailed information here (I’m working on better docs!)

How can you get it?

Pedantic mode is not currently included in the main version of the compiler, but it isn’t hard to get the test version thanks to @rok_cesnovar, @serban-nicusor and @mitzimorris.

Option 1: CmdStanR (easiest)

You can install the test version of CmdStan using CmdStanR with this command:
install_cmdstan(repo_clone = TRUE, repo_branch = "pedantic_mode", cores = 4)
This might take a few minutes.

Alternatively, you could install the Pedantic Mode branch of CmdStan manually and use it with CmdStanR. See Option 2, then use set_cmd_path() within CmdStanR.

To confirm that it’s installed correctly, try running:

library("cmdstanr")

#use the example model with an unused sigma
file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan")

mod <- cmdstan_model(file, force_recompile = TRUE)

Since the “bernoulli.stan” test file has an artificial issue in it, you should see:

Compiling Stan program...
Warning:
  The parameter sigma was declared but does not participate in the model.

Option 2: CmdStan

You can download the Pedantic Mode enabled CmdStan from GitHub on the pedantic_mode branch, here.
Pedantic Mode should be enabled by default with this version.

Option 3: Stanc3

You can find the development version of Stanc3 with Pedantic Mode here.
In order to build Stanc3, you will have to set up the appropriate OCaml environment, so this option is not as easy.
Use the --warn-pedantic flag to enable Pedantic Mode.

How can you help?

This tool will certainly have issues and bugs, so all eyes on it will help! You could run your existing models with pedantic mode turned on, or see if it helps you when you develop new ones.

I would appreciate all feedback and bug reports, such as:

  • How could the warning messages be more helpful?
  • Is it warning you about something that’s actually fine?
  • Is there something else you think it should warn you about?
  • Is it saying something totally wrong?

Please write your comments here or on the GitHub pull request here.

Let me know what issues you run into. Thank you!

@andrewgelman @bbbales2 @Bob_Carpenter @Matthijs
Thanks again to @rok_cesnovar, @serban-nicusor and @mitzimorris for help setting up the CmdStan branch

15 Likes

Cool! I’ll try it once Jonah etc can get cmdstanR working on my machine…

Was just working on a model and did:

real<lower = 0.0, upper = 0.0> f;

(lower and upper were the same thing). I think this should throw a warning.

Also:

real<lower = 0.0, upper = -1.0> f;

Should probably just be an error regardless of this.


The parameter f was declared but does not participate in the model.

“The parameter f was declared but is not used in the model.”


The parameter sigma has 0 priors.

“The parameter sigma has no prior.” But this also works if you give sigma two priors, so I dunno, maybe just keeping it simple is better.


Warning:
  The parameter sigma has 2 priors.
Warning at '/tmp/RtmpMOiqOH/model-3e237eb0f24e.stan', line 11, column 2 to column 23:
  The parameter sigma is on the left-hand side of more than one twiddle
  statement.

Do these need to be separate?


For the code:

functions {
  real func(real b) {
    if(b > 0.0) {
      return(1.0);
    } else {
      return(0.0);
    }
  }
}
data {
  int N;
  real x[N];
}
parameters {
  real<lower = 0.0> sigma;
  real<lower = 0.0> f;
}
model {
  real mu;
  x ~ normal(mu, func(sigma));
}

I don’t get a warning about the if statement on b (this comes up in ODE stuff a lot).


The model here has a few things going on.

Here’s a copy-paste model:

data {
  int<lower=0> N;      // number of data items
  int<lower=0> K;      // number of predictors
  int<lower=0> J;      // number  of classrooms
  int<lower=0> S;      // number  of schools
  vector[N] y;         // outcome vector
  vector[K] x[N];      // predictor matrix
  int classroom[N];    // numeric of classroom identifier
  int school[N];       // numeric of school identifier
}

parameters {
  real alpha;                   // intercept
  vector[K] beta;               // coefficients for predictors
  real<lower=0> sigma;          // error sd
  vector[J] b_raw;              // classroom random effects
  real<lower=0> tau;            // classroom sd
  vector[S] c_raw;              // school random effects
  real<lower=0> nu;             // school sd
}

transformed parameters {
  vector[J] b = b_raw * tau;
  vector[S] c = c_raw * nu;
}

model {
  real yHat[N];
  for(i in 1:N){
    yHat[i] = alpha + dot_product(x[i], beta) + b[classroom[i]] + c[school[i]];
    tau ~ normal(0,1);
    nu ~ normal(0,1);
    sigma ~ normal(0,1);
  }
  y ~ normal(yHat, sigma);       // likelihood
  b_raw ~ normal(0, 1);
  c_raw ~ normal(0, 1);
  beta ~ normal(0,1); 
}

Warnings are:

Compiling Stan program...
Warning:
  The parameter alpha has 0 priors.
Warning:
  The parameter b has 0 priors.
Warning:
  The parameter c has 0 priors.
Warning:
  The parameter nu has 0 priors.
Warning:
  The parameter sigma has 0 priors.
Warning:
  The parameter tau has 0 priors.
Warning at '/tmp/RtmpMOiqOH/model-3e233ba4a84f.stan', line 37, column 13 to column 17:
  The variable yHat may not have been initialized before its use.

tau, nu, and sigma all have priors if N > 1. I guess we’d need to assume the loops run?

b and c are transformed parameters and so by default I don’t think we’d assume they need priors.


It is suggested to replace lkj_corr with lkj_corr_cholesky, the Cholesky
  factor variant. lkj_corr tends to run slower, consume more memory, and has
  higher risk of numerical errors.

I think the first sentence should be “Reparameterizing model to use lkj_corr_cholesky instead of lkj_corr can help the model run faster with less risk of numerical errors”. Somehow it should emphasize that the model might change.


A distribution argument 0.001 is less than 0.1 or more than 10 in
  magnitude. This suggests that you might have parameters in your model that
  have not been scale to roughly order 1. We suggest rescaling using a
  multiplier; see section 22.12 of the manual for an example.

I’d make the small/large thresholds smaller/larger at least. Maybe at least 0.01 and 100.


Looking at the model here (with some stuff commented out to avoid syntax errors):

functions {
  real smvn_lpdf(vector as, vector mus, matrix covmats, vector omega, real delta, 
                 matrix all_Xs, matrix all_X2, vector all_P) { 
    
    // Define additional local variables
    vector[256] IMP; 
    real IV; 
    real logK; 
    real logW;  
    real lprob;  
    
    IMP = 1/(2*delta)*all_X2*omega - 1/delta*all_P;
    IV = log_sum_exp(1/delta*all_Xs*as + IMP);
    logK = log_sum_exp(IMP + 1/delta*all_Xs*mus 
                       + 1/(2*delta^2)*diagonal(all_Xs*covmats*all_Xs));
    logW = IV - logK;
    
    // calculate the log likelihood value
    lprob = logW + multi_normal_lpdf(as | mus, covmats);
    
    // return the log likelihood value
    return lprob;
  }
}  

data {
  int<lower=1> J; 
  int<lower=1> R; 
  int<lower=1> S[R]; 
  int<lower=3> C; 
  int<lower=1> W; 
  int<lower=1> N; 
  vector<lower=1,upper=C>[N] Y;  
  matrix[N*3, J-1] Xs;  
  matrix[N*3, W] X2; 
  vector[N*3] P;  
  matrix[256, J-1] all_Xs; 
  matrix[256, W] all_X2; 
  vector[256] all_P; 
}

parameters {
  vector[J-1] mus;
  vector<lower=0>[J-1] sigmas;
  corr_matrix[J-1] Thetas;
  vector[W-1] omegas;
  real<lower = 0> delta;
}

transformed parameters {
  vector[J] omega = append_row(omegas, -sum(omegas));
  cov_matrix[J-1] covmats = quad_form_diag(Thetas, sigmas);  
}

model {
  int yr;
  int xr;
  yr = 1;
  xr = 1;
  
  // hyperpriors
  mus ~ normal(0, 1000);
  sigmas ~ cauchy(0, 2.5); 
  Thetas ~ lkj_corr(2);
  
  // priors
  omegas ~ normal(0, 1000);
  delta ~ gamma(0.001, 1000);
  
  // likelihood
  for (r in 1:R) { // for each individual
    int ypos;
    int xpos;
    vector[J-1] as;
    vector[J-1] alphas;

    ypos = yr;
    xpos = xr;
    as ~ multi_normal(mus, covmats);
    //target += smvn_lpdf(alphas | mus, covmats, omega, delta, all_Xs, all_X2, all_P);
    
    for (s in 1:S[r]) { // for each choice task
    //Y[ypos] ~ categorical_logit((block(Xs, xpos, 1, 2, 6)*alphas+block(X2, xpos, 1, 2, 27)*omega-1/delta*segment(P, xpos, C))'); 
    ypos = ypos + 1;
    xpos = xpos + C;
    }
    yr = yr + S[r];
    xr = xr + S[r]*C;
  }
}

For the snippet in the model block:

vector[J-1] as;
...
as ~ multi_normal(mus, covmats);

The warning is:

Warning at '/tmp/RtmpMOiqOH/model-3e231c50f7f3.stan', line 80, column 24 to column 30:
  The variable alphas may not have been initialized before its use.

I like that this is a warning (wasn’t caught in the original thread!) but maybe the message can be more specific?

There are also these warnings which I think aren’t right:

Warning:
  The parameter Thetas has 0 priors.
Warning:
  The parameter covmats has 0 priors.
Warning:
  The parameter sigmas has 0 priors.

which I think aren’t quite right.

Thetas and sigmas have priors and covmats shouldn’t have a prior.


Enough for now. Will probably try some more later. This is very handy! I found at least one bug on some model searching the forums for examples (unconstrained standard deviation).

I wonder about the aggressive warnings on variables initialized in loops though.

There were a lot of things where code like this:

real a[N];
for(i in 1:N) {
  a[i] = whatever(i);
}

threw warnings that a might not be initialized.

Edit: And there were lots of these. Maybe these warnings should be less aggressive?

2 Likes

Thanks Ben this is great!

These actually do show a warning for me, are you sure they aren’t for you?

Fixed.

Fixed, for special case 0.

I am not sure if we need both, but they are in theory looking for two distinct things. The twiddle warning is just checking whether a parameter is the in the variate position of multiple built-in distributions. The priors warning is counting the factors (more general than LHS of twiddles) that are acting as priors. So, you could have either one without the other.

Yeah, interesting, I am not checking this right now, since b isn’t known to be a parameter until the function call. I think I could catch these cases with some more effort.

Yeah, nice example. We can often prove that loops run at least once, I should put some effort into that. Although strictly speaking, since N has lower=0, the loop really might not run, right?

This is something I hadn’t fully thought about. You’re right that parameters are really what needs priors, but I also need to check if priors are being provided via a transformed parameter.

“It is suggested to reparameterize your model to replace lkj_corr with lkj_corr_cholesky…”?

Honestly I have no opinion on this, happy to put in whatever folks agree on. The 0.1 and 10 came from Andrew’s wiki.

What other information do you think would be helpful here?

That seems to be a straight up bug with Thetas and sigmas, I’ll check it out

This should be handled correctly when I can identify these loops, but in the mean time I will probably just assume that loops run at least once for this particular analysis.

Thanks again!

Well I get:

Your Stan program has a parameter f with hard constraints in its
  declaration. Hard constraints are not recommended, for two reasons: (a)
  Except when there are logical or physical constraints, it is very unusual
  for you to be sure that a parameter will fall inside a specified range, and
  (b) The infinite gradient induced by a hard constraint can cause
  difficulties for Stan's sampling algorithm. As a consequence, we recommend
  soft constraints rather than hard constraints; for example, instead of
  constraining an elasticity parameter to fall between 0, and 1, leave it
  unconstrained and give it a normal(0.5,0.5) prior distribution.

But with:

real<lower = 0.0, upper = 0.0> f;

it should at least tell me that implies f is the constant 0.0, and with:

real<lower = 0.0, upper = -1.0> f;

it should tell me the bounds don’t make sense (and it’d be fine if it told me the bounds don’t make sense for <lower = 0.0, upper = 0.0> too).

If I see ifs in functions in ODE land I go looking for this stuff. I think just for that it’d be safe to go aggressive here (and assume function arguments are parameters).

We’re being pedantic after all :D.

Yeah, I dunno how much you can analyze this statically or not. But if N = 0 then yHat is size zero too, so then it’s okay.

When someone writes:

vector[J-1] as;
as ~ multi_normal(mus, covmats);

in the model block I assume they’re thinking that multi_normal is going to randomly draw from a multivariate normal (instead of incrementing the log density). So the error message would be something like, “as ~ multi_normal(mus, covmats) does not draw a multivariate normal random variable and assign it to as. It evaluates the log likelihood of a multi_normal distribution evaluated at as” or something like that.

Hi Andrew,

you should be able to test pedantic mode using CmdStanR by running the following command in R:

library(cmdstanr)
install_cmdstan(cores=4, overwrite=TRUE, repo_clone=TRUE, repo_branch="pedantic_mode")

good luck!

(note: original instructions neglected to include library(cmdstanr). mea culpa.

3 Likes

BenB:

If you define a parameter like this:
parameters {
real<lower = 0.0, upper = 0.0> f;
}
will it then reclassify it as a constant during compilation? That’s kinda weird; I agree that pedantic mode should flag this. Not as a “hard constraint” but rather as a mistake that upper and lower bounds are identical.

When the lower bound is higher than the upper bound, I’m surprised that Stan doesn’t already return an error! How could a Stan model even run if the lower bound is higher than the upper bound?

OK. I’ll have to play around with this pedantic mode myself now . . .

@andrewgelman wrote in 2 problems with cmdstanr: one model gives the wrong answer, the other gives an error in compilation - #6 by rok_cesnovar
Posting here so it doesnt get lost.

data {
  int<lower=0> J;         // number of schools 
  real y[J];              // estimated treatment effects
  real<lower=0> sigma[J]; // standard error of effect estimates 
}
parameters {
  real mu;                // population treatment effect
  real<lower=0> tau;      // standard deviation in treatment effects
  vector[J] eta;          // unscaled deviation from mu by school
}
transformed parameters {
  vector[J] theta = mu + tau * eta;        // school treatment effects
}
model {
  target += normal_lpdf(eta | 0, 1);       // prior log-density
  target += normal_lpdf(y | theta, sigma); // log-likelihood
} 

Thanks @andrewgelman and @rok_cesnovar. Ben also pointed this out, I have a solution but haven’t pushed my implementation yet.

1 Like

I’ve pushed some improvements. If you run install_cmdstan(repo_clone = TRUE, repo_branch = "pedantic_mode", cores = 4, overwrite = TRUE), it should pull the latest version.

Things I’ve updated:

  • Transformed parameters are no longer flagged for not having priors, but priors can still be applied to parameters via transformed parameters.
  • Control flow within function definitions are checked for parameter dependence.
    Ben’s example above now produces:
    Warning at 'fundef_cf2.stan', line 3, column 4 to line 7, column 5:
    A control flow statement inside function func depends on argument b. At
    'fundef_cf2.stan', line 20, column 22 to column 27, the value of b depends on parameter(s): sigma.
    This is currently only one level deep - it won’t (yet) find cases within two or more levels of function calls. This one had a pretty involved implementation.
  • Added a separate warning for bounds that don’t make sense (upper <= lower).
    This could probably be an error instead.
  • Loops are assumed to run at least once for the purpose of finding uninitialized variables.
  • Data variables with int type are ignored for the purpose of identifying priors.
    There were a number of issues where priors were marked as data-dependent because they depended on some data size variable N. Ideally, these size variables would be annotated somehow, but I think this workaround will work for most cases.
  • Updated wording on a number of warning messages.

Thanks for your feedback @bbbales2 and @andrewgelman! Looking forward to hearing more.

5 Likes

I had a model where I typed:

mu[j] ~ normal(alpha + building_data * zeta, sigma_mu);

And mu[j] was a real, but alpha + building_data * zeta here is a vector.

The correct model was:

mu[j] ~ normal(alpha + building_data[j] * zeta, sigma_mu);

Is this detectable, perhaps? It would be unusual to do the first, I think.

2 Likes

As a part of working on footnotes for the User’s Guide I went and built a line based heuristic linter with regular expressions. I am implementing the style guide from the User’s Guide.

I don’t know if you want to deal with a linter’s level of processing, you are more focused on semantics, but if you see any value in attending to white space placement etc… then feel free to raid it.

Repo at: https://github.com/breckbaldwin/stan_linter

Breck

2 Likes

@breckbaldwin This is cool, I would be in favor of adding this to the compiler, maybe in its own --lint flag? As you mentioned, I do think of pedantic mode as being more on the semantic level than lexical level.

@bbbales2 Sorry I missed this -

I’m actually surprised this isn’t a type error, maybe I’m not understanding correctly. Shouldn’t there not be a Stan Math signature for the first example?

Well it’s supported in the C++ cause we support all appropriately sized combos of vectors/arrays whatever.

So the case here is:

normal_lpdf(real | vector, real)

It’s a weird one. I don’t know of a use for it but it’s how we’ve defined the interface so far. It’s possible there’s a historical reason @Bob_Carpenter might know about.

We originally allowed this because the math is no different if the y and mu were swapped. There was no reason to restrict the language because we couldn’t see an immediate use for the model to be structured in a particular way. I’m glad we took that approach because it’s resulted in a lot of creative models that were enabled due to the flexibility of the language.

For this particular example, it would come up if you had a mixture model and each mu was the parameter of a mixture component.

Wouldn’t that lpdf return \prod_i\left[N(y|\mu_i,\sigma)\right] whereas a mixture model needs \sum_i\left[\phi_iN(y|\mu_i,\sigma)\right]?

:). Good catch. Yeah, definitely not for a mixture then.

I’d have to think harder for a good use other than maybe trying to do model selection in a vectorized way (which may also have problems).

No matter what, these were allowed so the user could be flexible in how they specified their model through the language.

Yeah on this one I don’t see a use, but I lean leaving it. I would like pedantic mode to throw warnings though.

This looks really great. Some suggestions on phrasing.

  1. Not used in the model → Not used in the program

    We want to be careful to talk about Stan programs, each of which implements a model. There’s more than one program for any model.

  2. “columns a–b” wold be easier to read than “column a to column b.”

  3. “alpha has 0 priors” → “alpha does not have a prior”

    How likely are false positives on this if we transform alpha and give it a prior?

  4. "A distribution argument 0.001 is less than 0.1 or more than 10 in magnitude. This suggests that you might have parameters in your model that have not been scale to roughly order 1. see section 22.12 of the manual for an example.

  5. “A distribution argument 0.001 is less than 0.1 or more than 10 in
    magnitude. This suggests that you might have parameters in your model that
    have not been scale to roughly order 1. We suggest rescaling using a
    multiplier; see section 22.12 of the manual for an example.”
    → Argument 0.001 suggess there may be parameters that are not unit scale;
    consider rescaling with a multiplier

    The bounds of 0.1 and 10 are suggestive of being outside unit scale, which is why they were chosen.

    Now the problem here is that only unconstrained parameters may be rescaled with a multiplier. And if those arguments are to distributions on data, then they may not be saying anything about parameter scales.

  6. “The variable alphas may not have been initialized before its use.” → “Use of variable before it is defined: alphas”

    I don’t like “initialized” because that’s not the only way to set a variable. Plus, they will be initialized to NaN in most cases. The correct term is “defined” (as opposed to “declared”). This may confuse users, though, who are going to think declaring a variable defines it. Also, is there a reason this says “may”? Is it trying to cover for conditionals not being executed?

  7. “a parameter f with hard constraints” → “a parameter f with a lower or upper bound”

    I don’t think “hard constraint” is informative enough. I also don’t see where an infinite gradient comes from—we either log or inverse-logit transform. Also, elasticity (change in demand / change in price) in econ is usually negative so it’s not a good example for a (0, 1) constraint.

    How are false positives handled here? For example, correlation parameters are usually constrained in (-1, 1) and many parameters will be constrained with lower bounds of 0.

tau, nu, and sigma all have priors if N > 1. I guess we’d need to assume the loops run?

Yup. In general, this problem is undecidable.

“Reparameterizing model to use lkj_corr_cholesky instead of lkj_corr can help the model run faster with less risk of numerical errors”. Somehow it should emphasize that the model might change.

The model won’t change if you give both of them the same LKJ prior.

It’ll throw a runtime error. It won’t catch it during compile time.

It we’re talking about things like spacing, indentation, etc., maybe we could write a pretty printer. That’s been super useful in our C++ development.

It’s just for symmetry. Any argument can be a scalar or container under the constraint that every container has to be the same size. Anything else would be a nightmare to document. The real problem we get is this:

parameter {
  vector[N] alpha;
}
...
model {
  for (n in 1:N)
    alpha ~ normal(mu, sigma);  // OOPS!  Should be: alpha[n]
...

It’s really just for symmetry. We also allow normal_lpdf(real | real, vector) where there’s no symmetry.

It’s so much easier to doc a general pattern than to have each distribution configured for expected uses. This is also why we have three one-dimensional container types, real[], vector, and row_vector. We really don’t need real[] for anything, but it’s much easier to just say "for any type T, T[] is the type of arrays of T.

3 Likes