Will stanc3 be able to identify mismatches between data constraints and distribution?

jonah · September 24, 2019, 4:50pm

@seantalts @Matthijs @enetsee @rybern (not sure who to tag specifically!) With the current/old stanc the following Stan program won’t have any issues (until runtime) even though a human can see that the bounds on y are incompatible with the bernoulli distribution used in the model block:

data {
  int N;
  int<lower=2,upper=3> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  y ~ bernoulli(theta);
}

At runtime we get:

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: bernoulli_lpmf: n[1] is 2, but must be in the interval [0, 1]

I’m curious if it would be possible for stanc3 to identify that the constraints on y in the data block (lower=2,upper=3) make it incompatible with the bernoulli distribution (which requires 0/1 values) before runtime? This isn’t the biggest deal, but would be nice. (Sorry if this had already been discussed somewhere I missed)

Edit: I meant to say that this occurred to me because of the problem reported by @erfpak7 in Initialization failed in Pystan

rybern · September 24, 2019, 6:04pm

It looks like we’re not currently doing this, but I don’t think it would be very difficult. It would add a little complexity though.
Disclaimer: I didn’t write the semantic checking code (which would be the part of the compiler responsible for something like this), so I could be wrong about how it works.

We would need to:

Add optional bound information to the Stan math library signatures, e.g. Bernoulli here:
https://github.com/stan-dev/stanc3/blob/e0ee22ebe693435e4805982c2ce024f1f2636f31/src/middle/Stan_math_signatures.ml#L135
Keep track of declared variable bounds to the symbol table, here:
https://github.com/stan-dev/stanc3/blob/master/src/frontend/Semantic_check.ml#L1385
Retrieve the bounds when semantic-checking a variable, here:
https://github.com/stan-dev/stanc3/blob/800ee7d5adc98803ec37515ccb91c9a221fe42b5/src/frontend/Semantic_check.ml#L407
Check that the bounds are consistent during argument checking, here:
https://github.com/stan-dev/stanc3/blob/3c2bbc5aa8969f1104af7a807c05d5f521418990/src/middle/Type_conversion.ml#L9

Curious to hear from folks more familiar with the checker.

Matthijs · September 25, 2019, 8:09am

In a way, the problem here is just that the type for the bernoulli distribution is wrong. It should be a distribution over Booleans. Similarly, there should probably be types for finite enumerations in case of categorical distributions. At least, that’d be one solution.

Another solution would be to do something like what Ryan is suggesting here. However, that would be hard to make work in general. In principle, the bounds that are specified might depend on runtime information, so might not be able to be evaluated at compile time.

Bob_Carpenter · September 25, 2019, 1:15pm

+1 to that. I’m not sure how to do it, though. As is, I just lazily followed the C approach of defining false to be 0 and true to be anything else.

That would help in some situations with not having to code everything as 1:K. But what people really want is something like names rather than numbers.

There are bunches of subtypes of numbers we could consider:

bool : int<lower = 0, upper = 1>
count: int<lower = 0>
non-negative: real<lower = 0>

As is, everything just compiles down to simple types. So even if a variable is declared as real<lower = 0>, at runtime it’s just a double (or stan::math::var).

Not in general because Stan’s function language is Turing complete, so it’s technically an undecidable problem. Nevertheless, that shouldn’t be much of an obstacle in practice. As @rybern points out, we can go a long way with heuristics.

It might be possible heuristically like the rest of the “pedantic mode” features that @andrewgelman has requested. That is, we can check if the data bounds are resolvable statically. Compare these three cases:

Statically resolvable

Here, the bounds are constants defined in the program, so they can be evaluated at compile time.

data {
   int<lower = 0> N;
   int<lower = 1, upper = 2> y[N];

Run-time resolvable

But if we have this,

data {
  int<lower = 0> N;
  int<lower = 0> a;
  int<lower = a> b;
  int<lower = a, upper = b> y[N];

then we can’t say anyting until we get the data.

Undecidable

The potentially undecidably hard case is:

data {
  int<lower = 0> N;
  int<lower = foo(1.7), upper = 1> y[N];

Now the lower bound only involves data, but we can’t in general even determine if foo(1.7) will terminate, much less when it’ll terminate and what the value will be. But we won’t be able to do static evaluation on the program as that’s what’s undecidable—we have to use the heuristic of running with a time limit.

jonah · September 25, 2019, 3:09pm

Thanks for all those details!

enetsee · September 26, 2019, 10:42am

I did this in another PPL (written as a library for F#, a functional language for .NET) by defining bounded discrete distributions as a push-forward of a distribution on integers through an arbitrary function. To do this in Stan we would have to introduce enumerations and parametric polymorphism, I think.

ariddell · December 15, 2019, 4:23pm

Just wondering this myself. Almost exactly the same situation. I had forgotten to declare the bounds on p, used in p ~ beta(alpha, beta). The resulting error was unhelpful. I think it was a domain error.

Bob_Carpenter · December 19, 2019, 11:46pm

That should cause a domain error in beta_lpdf whenever alpha or beta are negative. What were you expecting to happen?

Any of this kind of thing is semantic and will thus go into what’s known as a linter. @andrewgelman has dubbed the Stan linter “pedantic mode”; the link is to a wish-list.

ariddell · January 28, 2020, 11:40am

I was expecting a compile-time error that told me I forgot to put bounds on p.

mcol · January 28, 2020, 12:14pm

Thanks to @rybern there’s a PR open that implements some of that already: https://github.com/stan-dev/stanc3/pull/434

Topic		Replies	Views
Fitting a basic uniform distribution -- what am I missing? Modeling fitting-issues	11	868	December 7, 2020
Initialization failed for simple multivariate binomial model Modeling	2	322	August 29, 2019
Pystan runtime error: initialization failed Modeling pystan , initialization	2	1398	May 26, 2022
Debug help Modeling	2	794	November 29, 2017
Issues in my stan code Modeling rstan , fitting-issues	8	555	January 26, 2021

Will stanc3 be able to identify mismatches between data constraints and distribution?

Statically resolvable

Run-time resolvable

Undecidable

Related topics