Will stanc3 be able to identify mismatches between data constraints and distribution?

@seantalts @Matthijs @enetsee @rybern (not sure who to tag specifically!) With the current/old stanc the following Stan program won’t have any issues (until runtime) even though a human can see that the bounds on y are incompatible with the bernoulli distribution used in the model block:

data {
  int N;
  int<lower=2,upper=3> y[N];
}
parameters {
  real<lower=0,upper=1> theta;
}
model {
  y ~ bernoulli(theta);
}

At runtime we get:

Chain 1: Rejecting initial value:
Chain 1:   Error evaluating the log probability at the initial value.
Chain 1: Exception: bernoulli_lpmf: n[1] is 2, but must be in the interval [0, 1]

I’m curious if it would be possible for stanc3 to identify that the constraints on y in the data block (lower=2,upper=3) make it incompatible with the bernoulli distribution (which requires 0/1 values) before runtime? This isn’t the biggest deal, but would be nice. (Sorry if this had already been discussed somewhere I missed)

Edit: I meant to say that this occurred to me because of the problem reported by @erfpak7 in Initialization failed in Pystan

1 Like

It looks like we’re not currently doing this, but I don’t think it would be very difficult. It would add a little complexity though.
Disclaimer: I didn’t write the semantic checking code (which would be the part of the compiler responsible for something like this), so I could be wrong about how it works.

We would need to:

Curious to hear from folks more familiar with the checker.

2 Likes

In a way, the problem here is just that the type for the bernoulli distribution is wrong. It should be a distribution over Booleans. Similarly, there should probably be types for finite enumerations in case of categorical distributions. At least, that’d be one solution.

Another solution would be to do something like what Ryan is suggesting here. However, that would be hard to make work in general. In principle, the bounds that are specified might depend on runtime information, so might not be able to be evaluated at compile time.

1 Like

+1 to that. I’m not sure how to do it, though. As is, I just lazily followed the C approach of defining false to be 0 and true to be anything else.

That would help in some situations with not having to code everything as 1:K. But what people really want is something like names rather than numbers.

There are bunches of subtypes of numbers we could consider:

  • bool : int<lower = 0, upper = 1>
  • count: int<lower = 0>
  • non-negative: real<lower = 0>

As is, everything just compiles down to simple types. So even if a variable is declared as real<lower = 0>, at runtime it’s just a double (or stan::math::var).

Not in general because Stan’s function language is Turing complete, so it’s technically an undecidable problem. Nevertheless, that shouldn’t be much of an obstacle in practice. As @rybern points out, we can go a long way with heuristics.

It might be possible heuristically like the rest of the “pedantic mode” features that @andrewgelman has requested. That is, we can check if the data bounds are resolvable statically. Compare these three cases:

Statically resolvable

Here, the bounds are constants defined in the program, so they can be evaluated at compile time.

data {
   int<lower = 0> N;
   int<lower = 1, upper = 2> y[N];

Run-time resolvable

But if we have this,

data {
  int<lower = 0> N;
  int<lower = 0> a;
  int<lower = a> b;
  int<lower = a, upper = b> y[N];

then we can’t say anyting until we get the data.

Undecidable

The potentially undecidably hard case is:

data {
  int<lower = 0> N;
  int<lower = foo(1.7), upper = 1> y[N];

Now the lower bound only involves data, but we can’t in general even determine if foo(1.7) will terminate, much less when it’ll terminate and what the value will be. But we won’t be able to do static evaluation on the program as that’s what’s undecidable—we have to use the heuristic of running with a time limit.

Thanks for all those details!

I did this in another PPL (written as a library for F#, a functional language for .NET) by defining bounded discrete distributions as a push-forward of a distribution on integers through an arbitrary function. To do this in Stan we would have to introduce enumerations and parametric polymorphism, I think.