Stanc3 type system overhaul

I’d like to start a refactor of the type system and its architecture that we’re using in stanc3. Right now we’re essentially just using a flattened type environment hash map that we copied from the Stan 2 compiler. @enetsee has brought up the idea of refactoring this such that distributions are a first class concept in a new type system, and based on those conversations and a bunch of further thought around what we’re working on adding to the language, I think that’s an important step to take really soon as it could make current and impending projects much easier. I’d like to take those ideas a step further and hopefully also deal with bijectors in the same way, as well as come up with a system for something like “computational contravariance” that will let us deal with GPU, sparse, and possibly “parallel” matrices and collections in a smart and well-founded way.

I know a lot of the issues I’d like to solve, but the details of how to solve them are very much up in the air. I’d love to have a meeting to talk more about this. I think the only required attendee is @enetsee, but I think @Matthijs and @rybern would be really great to have along if they’re not busy.

@enetsee as for timing do you want to use the discussion section of one of the stanc3 meetings this week for this topic? I feel like it’ll kind of be an architectural planning meeting with some drawings, haha.

1 Like

I’m free to talk starting later in the day on Thu or otherwise on Friday.

The afternoons of Thursday / Friday would work for me too.

Thursday works for me, possibly Friday too. I’d be happy to sit in at least.

Thursday the Stan meeting could end as late as 1PM EDT, which is ~6 hours later in Europe. @enetsee said that was fine for him, would 1pm or later work for you @Matthijs?

Good point. Shall we plan a time on Fri, then?

I now have two meetings Friday morning as well ending at noon Friday. Sorry for scheduling nightmare! Michael also mentioned that afternoons are somewhat more predictable / better for him since he’s got work during the day. I could also do 9am Friday for 45 minutes…

How does Monday or Tuesday next week look instead? I’m currently unbooked all day both days other than the normal stanc3 standup.

OK, how about we use the normal stanc3 meeting on Monday? Sorry to have been absent so much. It’s just been a lot, getting started with teaching, writing a grant proposal and getting the AD paper finished before a conference deadline.

No worries. That works for me, not sure if @enetsee can take that much time off work on Monday or not?

Okay, we ended up having that meeting. To attempt to summarize some of our discussion:

  1. I personally want to see stanc3’s internal representation of distributions and bijectors become a little more formalized and structured; that has its own benefits and might shed light on what working with it would be like in Stan
  2. People had no immediate objections to thinking about a distribution as a function that returns a record containing the relevant functions (rng, lpdf, cdf, ccdf). example: normal(mu, sigma).lpdf or lpdf(normal(mu, sigma)). The first version assumes we end up with some record access syntax like that in Stan; the 2nd could be done with tuples.

I want to take this a step further and think about what user defined gradients and user defined transformations/bijectors would look like as well, since we’d like to re-use patterns as much as possible. Transformations can follow pretty seamlessly the distribution path; to define a transformation, define a function that returns a record with forward, inverse, and log_det_jacobian (or some similar name) for the jacobian adjustment. This requires closures, hopefully soon to come.

User defined gradients COULD follow a similar path, giving a function that returns a record containing a value and derivative functions. @bbbales2 what are your thoughts about this? Then, to construct a user-defined distribution that has custom gradients, you do something like this:

def normal(real mu, real sigma) {
  auto value =  (real y) { 1/sqrt(2*pi... } ;
  auto derivative = (real y) { { (y - mu) / sigma, (mu - y) / sigma, ... } )
  return {lpdf= {value= value, derivative=derivative}};

I think this is kinda bloated syntactically. We’re in a DSL, after all. And we’ve already made a decision to use string suffices to group things before - namely _lpdf, _rng, _cdf. We could pull that back in Stan 3, or we could lean in, adding something like _grad for user-defined gradients and _forward, _inverse, and _log_det_jacobian for transformations. And then we can represent this stuff internally in the MIR as nice nested records or something.

I’m going to post a separate thread about syntax for user defined gradients and user defined transformations…

I didn’t even realize people were discussiong real issues on the meetings category.

Those two things are unrelated, right? Other than that you need to transform a random variable to match the constraint of a distribution.

I take it the full syntax would now be normal(mu, sigma).lpdf(y) or lpdf(normal(mu, sigma))(y)? I couldn’t follow how the latter related to tuples/structs.

Will that be in addition to write normal_lpdf(y | mu, sigma) or are you thinking of deprecating the latter?

That’d work for something like our log densities with univariate output. But it won’t be performant enough in the multivariate case because the multivariate Jacobian is too heavy to compute and store in the forward pass and also too heavy to apply naively in the backward pass. See the whole thread on developers related to this where the adjoint-Jacobian pattern’s being kicked around.

I wanted to include user-defined transformations in the transformed parameters section all along but I believe @betanalpha and @andrewgelman were strongly opposed. As is, there’s no way to actually define a constrained variable yourself in Stan to behave like Stan because there’s no way to control whether it gets turned on/off during the log density eval (it’s off for optimization for MLE, on for MCMC, and should be on for optimization for MAP, but isn’t).

On another topic, @avehtari and others have been asking for a way to get vectorized output from log densities. That is, something like normal_lpdf_vec(y | mu, sigma) that would return a vector of log densities for each element (assuming one or more of y, mu and sigma is a collection) instead of their sum.

1 Like

They are unrelated but may use similar representations as they are conceptually both groups of functions. If I were a category theorist I would know the words for this; basically a distribution and a bijector are each functions from any number of arguments to a collection of functions. A distribution like normal takes mu and sigma and returns an lpdf from y to a density real, a rng that takes no arguments and returns a real, etc. A transformation or bijector for lower=0 for example takes a lower bound parameter and returns a forward function R->R, an inverse, and a jacobian (for each? we usually only care about the forward’s jacobian I think).

Yeah, those are two ways we could change Stan syntax to reflect this. I’m not really sure how I feel about adding this or replacing the _lpdf, I mostly just wanted to brainstorm here. Do you have a sense for how users would feel between those options? Perhaps a focus group or survey is appropriate. I think this is a Stan 3 thing - I don’t think I’d propose adding a new way of getting at normal’s lpdf without removing the old one.

Yes! This becomes more clear in the paradigm where a distribution returns a collection of functions. We could add another to this collection (maybe something like vec_lpdf? naming is hard. @avehtari do you have a proposed name for this somewhere?). In our current scheme, that would mean adding *_vec_lpdf functions for each of our distributions. We could actually do that in the compiler and just generate a for loop that returns the right values.

Interesting - I’m not immediately seeing what would go wrong if we did the following:

  1. allow a user to define three functions in the functions block, give them some mechanism for grouping them and letting us know that they are to be collectively considered a transform. I guess if we wanted to add data variables, we’d need a fourth function to check that a variable adheres to the transform.
  2. allow users to annotate variables with the name of the transformation, e.g. real<lower=0> would refer to a transform called lower.

Then, those act the same way that our existing transforms act, warts and all. Is that also objectionable?

There were some discussions long time ago including discussion whether it would be in name or as an option, but no clear winner. I’m fine with _vec_lpdf or _lpdf_vec, and I have survived also with for loop.