Is Ragged Array allowed in Stan?

The Stan manual says that ragged arrays are not allowed, but I also found this github tutorial on ragged arrays spec (https://github.com/stan-dev/stan/wiki/Ragged-array-spec). The github tutorial seems to be more latest than the manual; does this mean that ragged array is now allowed in Stan?

No

The Stan 2.18. user guide states in section 9:

Stan does not directly support either sparse or ragged data structures, though both can be accommodated with some programming effort.

Check under assets here:

That’s not a tutorial. That’s a proposal for a possible future feature of Stan.

I see. No wonder I ran into a lot of syntax errors when trying to implementing those codes.

OTOH there are very good ways to implement ragged arrays in Stan. For example:

int n_total_entries;
vector[n_total_entries] flat_array;

int n_l1_entries;
int starts[n_l1_entries];
int stops[n_l1_entries];
int lengths[n_l1_entries];

Then to get the k'th vector:

vector[lengths[k]] v = flat_array[starts[k]:stops[k]];

@sakrejda Thank you! Do you have suggestions for a ragged array of matrices, where each matrix has different size? I am very new to Stan programming; any suggestion or reference would be appreciated!

What you want to do with them in Stan and what are the constraints on their sizes? There’s no general answer but there are really good specific answers.

@sakrejda I want to input into Stan an array of matrices of size [num_subjects, num_basis, num_visits]. In other words, there is a corresponding matrix for each subject, and each matrix has number of rows = number of basis,
and number of columns = number of visits. Note that the number of basis is the same for all subjects, but the number of visits is different. I record the number of visits for each subject in a vector called V. I am wondering whether I could extract the value from the vector V when loading the array of matrices as well. After loading such data, then I would perform MCMC for each subject based on their matrices of basis information. Is this clear enough?

Yes that’s clear. That’s only ragged in one dimension so in Stan I would stack subjects and visits:

int num_subjects;
int num_visits[num_subjects];
int subject_starts[n_subjects];  // row of x where each subject starts
int subject_stops[n_subjects];  // row of x where each subject ends

int num_basis;
matrix[sum(num_visits), num_basis] X;

The rest depends on how you use X. If it’s a model matrix then you have a pretty
standard (G)LM setup with X as the model matrix.

My preference by now is to not have a starts and stops vector, but rather have a single one which has n+1 elements whenever you have n items. In each entry i you have the starting position of the ith element and the i+1th element is one past the end of the ith one (which happens to be the start of the i+1th one). This way you keep things tidy with a single index vector instead of two… which makes it less cluttered for me.

1 Like

I have a similar problem in which I need to access a distance matrix \mathbf{D}_j for each family to define a correlation structure via Gaussian Process Regression: \mathbf{K}_j=\exp\left[-\rho\mathbf{D}_j\right] (element-wise). Any ideas as to how to go about passing/building each \mathbf{D}_j in Stan using clever indexing? I’m sorry, but I don’t follow the examples given above.

I can build \mathbf{K}_j “on demand” within the model or transformed parameters block, but I don’t want to have to continuously rebuild each \mathbf{D}_j - as I have ~10,000 families this will likely be very slow.

Need more info here. Are the distance matrices fixed and so \rho is the only unknown? Do you mean that you have an array of J distance matrices, like matrix[T, T] D[J];?

Yes, my distance matrices are fixed - I am trying to learn the unknown tuning parameter \rho. If I understand your question correctly, then my T would essentially an array of ints giving the dimensionality of \mathbf{D}_j, e.g. T=c(1, 1, 2, 5, 7). To put it another way, I have (in R) a list D in which D[[j]] accesses \mathbf{D}_j with size n_j \times n_j.

OK, I understand now. That is going to be annoying currently. I think the least annoying option would be to compute the maximum value of T and declare the thing in data like matrix[max(T), max(T)] D[J]; and then in R put all the distances for the j-th family in the top left of the j-th matrix, leaving any unused cells as \infty.

Then in your model block, do something like

model {
  for (j in 1:J) {
    matrix[T[j], T[j]] K = exp(-rho * block(D[j], 1, 1, T[j], T[j]));
    // likelihood contribution of the j-th family
  }
}

Makes sense - thanks! Since the block command just selects the appropriate entries of the “padded” D[j] I could just as easily define the unused cells as zero, say?

You could, but if you make a mistake, it will show up more destructively if you make the unused cells infinity.

Actually negative infinity would be better due to the exp(-rho * ...) thing.

Great idea, thanks. I assume that block also works on arrays of integers as well?

No, but you can use 1:T[j].

Thanks for your help on this, Ben. Just wanted to let you know that the model compiles and runs correctly.

1 Like