Brainstorming named tuples/structs

@avehtari and I were discussing today that there are many cases where tuples can be used but are sub-optimal because you need to track that t.1 is one thing and t.2 is another, without naming those.
The natural extension of this is a named tuple/struct/record type, where which would enable usages like qr_result = qr(M); print(qr_result.Q);

The implementation difficulty of these types was mostly solved while developing tuples.

But there is a lot of language-level questions we’d want to answer before adding them to Stan, which is the point of this post to ask for feedback on (basically the only thing I think people do already agree on is that variable.name would be how you access a piece of one)

Question 1: Syntax for types

How do you write down a type of these? Lets say I want a type with an element named Q and an element named R.

tuple(Q : matrix[N,N], R : matrix[N, N])?
struct { matrix[N,N] Q; matrix[N,N] R; };?
…?

Presumably we would also like to give names to these types. Should we have a types {} block before functions {}? How would that look?

types {
 type my_struct = ...;
}
types {
struct my_struct { ... } ;
}

Note that we’d really need these to be parameterized over sizes, so something like type[N,M] my_struct = tuple(vector[N], vector[M]); would be possible

Do we want to allow inline usages, or should we force you to give them all names?

Question 2: Syntax for expressions

Essentially the same question as above: how do we want it to look to create a value of one of these types.

If they have names, we could use ‘constructor style’: my_struct(Q, R)
If they don’t, we could introduce some new syntax. In ML family of languages, they use {Q=...; R=...}.

Question 3: Does order matter?

Is struct { matrix[N,N] Q; matrix[N,N] R; }; a different type than struct { matrix[N,N] R; matrix[N,N] Q; };?

This influences the syntax question above. C family languages usually say yes; ML family and dynamic languages like Javascript would say no.

Question 4: Is subtyping possible?

Related to the ordering question, if I have a function that expects a { real x; real y; } struct, should I be able to pass { int x; int y; } to it? Should I be able to pass { real x; real y; real z; }?

1 Like

Yeah named tuples would be great. Language design and implementation is really not my wheelhouse, but would it be possible to start by keeping it pretty narrow? Like just labeled fields and dot access? If that’s possible would that give us most of the benefit with much less type system complexity?

1 Like

As an average user of Stan, with a background in python, I prefer struct to tuple or namedtuple to tuple to avoid confusion.

I also think, as a completely subjective opinion, that the struct syntax feels more consistent with the C-like style of Stan. So this just fits in better:

This feature would be useful for me, at least, if that helps with motivation. I have struggled with tuples in the past because I am not smart enough to remember what something like “var.1” is supposed to mean if the context is at all complicated.

1 Like

Can you share Stan code where you now use tuples or where you would like to use named tuples? Seeing real examples would be helpful

1 Like

Sure. I often deal with ragged data like this:

data {
  ...
 // ========= dilution graph table ===========
  
  vector<lower=0>[graph_dim] target_blank_volumes;
  vector<lower=0>[graph_dim] target_transfers;
  array[graph_dim] int parents;
  // note that this parents vector must be topologically
  // sorted, so that parents[i]<i:w
  
  // link to dilution volume transfer system
  array[graph_dim] int transfer_vts;
  array[num_nonzero_blanks] int blank_vts;
  ...
}

The details of the full model may not be relevant, but for context, I work with microbiological data produced by many different people and devices. This is part of a model to give robust estimates for the concentration of virions or cells in samples using data across various contexts with different error characteristics.

Long story short, there are many tables like the above, and to help group the conceptually related data, I tried using something like this:

data {
  ...
 // ========= dilution graph table ===========
  
  tuple(
    vector[graph_dim], 
    vector[graph_dim], 
    array[graph_dim] int, 
    array[graph_dim] int, 
    array[num_nonzero_blanks]
  ) dilution_graph_table;
  ...
}

I’m not sure if this syntax is quite right, and it’s been awhile since I tried it, but hopefully the idea is getting across.

Unfortunately, later code became extremely difficult to read and understand, because dilution_graph_table.1 loses too much meaning compared to target_blank_volumes. If I could do something like dilution_graph_table.target_blank_volumes, I would find it clear and “clean” for lack of a better term.

I have also used tuples in functions which, for efficiency, compute two or three related outputs in a single call and return them together. In that context, I typically have a step where I unpack the returned tuple to give the outputs meaningful names. Structs/namedtuples would also make this easier, in my opinion.

2 Likes

Would you use named tuples as parameters or generated quantities?

Yeah, I could see that being useful, more for generated quantities than parameters. In the GQ block, I often have a bunch of code related to multiple purposes (i.e. LOO, PPC, counter-factual stuff), and I often find myself having a bunch of variables with the same prefix. Those are prime locations to consider a namedtuple. Maybe I am just unnaturally fixated on grouping things together. Here’s an annoyingly verbose example:

generated quantities {
  vector[compute_log_lik == 1 ? ND : 0] log_lik_stool_count = rep_vector(0.0, compute_log_lik == 1 ? ND : 0);
  vector[compute_log_lik == 1 ? M : 0] log_lik_grade = rep_vector(0.0, compute_log_lik == 1 ? M : 0);
  vector[compute_log_lik == 1 ? M : 0] log_lik_dysentery = rep_vector(0.0, compute_log_lik == 1 ? M : 0);
  vector[compute_log_lik == 1 ? N_weight_day : 0] log_lik_weight = rep_vector(0.0, compute_log_lik == 1 ? N_weight_day : 0);
  vector[compute_log_lik == 1 ? S : 0] log_lik_cfu_direct = rep_vector(0.0, compute_log_lik == 1 ? S : 0);
  vector[compute_log_lik == 1 ? S : 0] log_lik_cfu_enrich = rep_vector(0.0, compute_log_lik == 1 ? S : 0);
  vector[compute_log_lik == 1 ? P_ss : 0] log_lik_ss_counts = rep_vector(0.0, compute_log_lik == 1 ? P_ss : 0);
  vector[compute_log_lik == 1 ? P_mac : 0] log_lik_mac_counts = rep_vector(0.0, compute_log_lik == 1 ? P_mac : 0);
...
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_loose_stools_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_loose_stools_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_loose_stools_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_loose_stools_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_loose_stools_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_loose_stools_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] vector[3] gcomp_grade_probs_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] vector[3] gcomp_grade_probs_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] vector[3] gcomp_grade_probs_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_dysenteric_stools_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_dysenteric_stools_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_dysenteric_stools_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_total_stool_weight_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_total_stool_weight_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_total_stool_weight_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_stool_weight_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_stool_weight_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_stool_weight_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_mean_log_cfu_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_mean_log_cfu_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_mean_log_cfu_contrast_rep;

  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_log_cfu_control_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_log_cfu_treatment_rep;
  array[compute_gcomp == 1 ? gcomp_reps : 0] real gcomp_peak_log_cfu_contrast_rep;

}

A better programmer could probably find a way to do this type of thing more concisely, but it is a real example of where I’ve settled given the available features.

2 Likes

I think you still need answers to all these questions, but you can certainly pick the ‘simplest’ answers.

For question 3 (ordering), the simplest answer is “yes, ordering matters”. For question 4 (subtyping), the simplest answer is “no, they must match exactly”.
One nice benefit of these is that they’re (mostly) backwards compatible with loosening those restrictions in the future, since that would just allow some currently-not-compiling code to be accepted down the line.

(re “mostly”: these choices in tandem correspond to something that language designers call ‘nominal types’: two types can be compared by their names alone. The alternative is ‘structural typing’, which is more along the lines of ‘a type is what it contains/what shape it has’. Nominal typing allows you to write overloaded functions that accept data of the same underlying structure but differing in names, while structural typing allows you to write functions that are generic over certain things like ordering, or extra fields that get ignored. So, someone could write functions that are overloaded based only on the ordering of the struct members, and a future change to allow arbitrary ordering would break this unless we additionally had some nominal feature on top of that, like ‘brands’ in Typescript)

But any version of this at all will need good answers to all of Qs 1&2 as a starting point.

1 Like

It certainly does, and your examples in the following posts are also incredibly helpful. Thank you!

I like this syntax for types:

We might also want to have typedefs for function arguments, which don’t have sizes.

For values, I like this or the version that replaces = with : so they look like Python dictionaries or R lists.

No.

It would certainly be nice to support.

I don’t really like passing a larger structure, but I could be convinced. That is, I don’t like passing

when the signature is

This does bring up the issue of what happens if someone only specifies some of the values for a struct. There should probably be defaults that match the default inits elsewhere and then we should be able to set pieces, e.g.,

struct { int x; real y; } a = {x: 1.2};
a[y] = -3.2;

It feels weird to not quote the names!

I would actually say this kind of partial declaration should not be allowed, because it would be equivalent to a partial assignment like

struct { int x; real y; } a;
a = { x: 1 };

Allowing this to type check would be, in my opinion, a mistake. Currently in Stan, assignments are only valid if the types are equal or if the right hand side is a subtype of the left hand side, but saying struct {int x} is a subtype of struct {int x; ... } is wrong under most any definition of subtype.

It should certainly be possible to leave off any assignment during declaration, which would have all fields be their default values in Stan (i.e. nan or MIN_INT). It occurs to me that this would actually allow us to punt on question #2 entirely, if we just left out any expression-level syntax for these types in the initial version. That is, the only way to give a value to a here could (at least in version 1) be

struct {int x, real y } a;
a.x = 1;
a.y = -3.2;

This would reduce the number of questions we need to answer by one (and because { and } are already used for blocks and array literals, it also removes what might be a tricky parsing conflict to solve)

I agree about sub typing here.

But I don’t think we need strict sub typing for assignment. For example, I wouldn’t mind if we allowed arrays to be assigned to vectors without strictly making them subtypes.

I’m also OK with the null allocation and then dynamic assignment. That feels like what we do elsewhere.