Ragged array functional specification (proposal)

This discussion is deprecated. For the latest discussion, see:

Now that tuples are well under way, ragged arrays will be next. To start, I created a simple functional specification for ragged arrays (with some input from Mitzi and residual text from the last time I tried to write out the spec):

Comments welcome (which is why I put it here rather than the developer-post-only section).

I am a new user of Stan. I was looking at your ragged array spec tutorials on github. However, I was running into a lot of troubles when implementing your code, even the simplest example like

int<lower = 0> N;
int<lower = 0>[N] M;
int[M] y;

R gave me a syntax error like below:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:

error in ‘model174444bd7dc11_raggedArray_example’ at line 4, column 6

 2:   int<lower = 0> N;
 3:   int<lower = 0> M [N];
 4:   int[M] y; 
         ^
 5: }

PARSER EXPECTED:
Error in stanc(file = file, model_code = model_code, model_name = model_name, :
failed to parse Stan model ‘raggedArray_example’ due to the above error.

Do you know why? Thanks!

It’s just a proposal that hasn’t been implemented yet! Maybe in a year or two we’ll have proper ragged arrays. Until then, the manual contains instructions on how to code ragged arrays.

It occurs to me that if you didn’t have to specify the sizes of data arrays, vectors, and matrices in advance, then any data array declaration could be treated as a possibly ragged array declaration. For example, an array of vectors of varying length might be declared as, say, vector foo[];. This would have the added advantage of mirroring the syntax already used for specifying array types in function declarations. Of course, this would mean that the various Stan interfaces would have to infer the intended dimensions of these data arrays, but judging from the “dimensional mismatch” error messages that I’ve seen, it looks like this is already being done anyway.

That said, you’d still need to specify the dimensions of parameter arrays, vectors, and matrices in advance, so you’d still need a ragged array specification for that.

Because it is not implemented yet.

That’s right. It would be a radical change in terms of how variables are defined. It’d be much easier to do for local variables as @bgoodri has often pointed out—the problem with parameters is that we need rectangular output for HMC, R-hat, etc.

At run time, every variable with a value is sized and it can never be reassigned to a variable of a different size.

Right. There are currently three different type syntaxes:

  • block variables: +constraints, +sizes
  • local variables: -constraints, +sizes
  • function arguments: -constraints, -sizes

We need the constraints on transformed parameters for semantics and we need the sizes on data for the way we do I/O and we need consistent numbers/locations of parameters for all of our posteriors. So we probably don’t want to change block variable declarations.

Function arguments already do what you want.

So let’s look at local variables. What would the size be of a vector declared like this:

for (n in 1:10) {
  vector x; 
  int n = rows(x); // ???

Anything other than zero would be totally arbitrary. So let’s assume they start at size zero. So now what happens when we do this:

vector x;
x[1] = 3;  // error!  x is of size 0

It seems like we’d either need to change our semantics to be like R and resize whenever you get an index out of bound or we’d have to have a way of setting the size. Maybe we have a x.set_rows(int) method that works like C++ std::vector equivalent and trims if it’s longer or expands with default values if it’s shorter. That’s really inefficient (see R), but perhaps doable. The efficiency hit wouldn’t be too bad as we do bounds checking on indexes anyway. We don’t have anything else object-oriented like this or mutable like this for functions, but it might be possible.

Or we could just treat the above as an error. If it’s declared without a size, the only legal thing to do is assign. So then we’d need something like partial size declarations,

int[N, ] x;

for a ragged array of size N with potentially different numbers of elements in each row.

variable name conflict - also, is this about local variables, and the for loop here is just to set up the local variable - it’s not about changing the size of the vector w/ each look iteration or anything like that is it?

Sorry for the confusing example—the variable duplication was a mistake and the loop’s a red herring.

The issue would be what to do with this:

vector x;
int k = rows(x);

I think n has to be zero here—that is, x has to start out size zero. But then how do we add elements to it?

The bigger question’s whether we can get rid of sizing on local variables.

Or whether we want to do size checks on assignment or just allow things to get ragged or change sizes. I believe this is what @bgoodri was suggesting in the special case that the container was initially sized zero.