(possibly bad) Idea for (debatably) easier ragged arrays

While the proscription in the SUG for ragged arrays is probably best for performance, I’m working on a scenario (3-level nested hierarchy with unequal #s of observations at each nesting stage) where it’s going to be a headache to actually implement.

It occurs to me that one trick might be to simply treat the missing elements as parameters just as we do with regular ol’ missing data. This expands the number of parameters pretty substantially, but where hierarchical models already reflect very high dimensional parameter spaces, it shouldn’t be too impactful on performance. Or does anyone think my intuition is way off here?

I’d be nervous about the performance or these things leaking into the rest of the inference and influencing it in some weird way.

Is it the nesting that makes this a headache or what? Asking because I don’t do much modeling and am curious.

I encountered the difficulty of ragged arrays for the first time just recently despite working with Stan for several years now, so it’s not super common I guess. But on further consideration I realized that this is actually what happens automatically in any hierarchical model where there’s a design matrix X and one or more of the “random effect” levels is missing data for one or more cells in the design matrix. For example, a typical hierarchical data scenario in the cognitive psychology world could consist of a bunch of human participants that are each observed on some outcome on multiple trials in each of multiple experimental conditions. If for whatever reason a given participant fails to have any data in one of the conditions, and if the standard hierarchical model (as in SUG 11.3) is employed, the model will still generate an estimate for that participant’s coefficient for that condition, it just won’t then show up in the subsequent likelihood computation. (Well, depending on the # of conditions and contrasts used in the design matrix; certainly this holds for treatment contrasts with any number of conditions but doesn’t apply to sum contrasts with two conditions)

1 Like

The lack of ragged arrays, coupled with a lack of some basic indexing capabilities, does make nested models with missing data quite a headache. I’m extremely skeptical that this approach is a good solution in anything but a tiny subset of cases though :)

1 Like

In your particular case - You can just ‘flatten’ the hierarchy. This is moreso how brms does it (and probably rstantarm). If you don’t have the same number of observations per group - that is totally fine; just create unique labels for each nested group, and make an array of parameters that has as many unique labels. I.e., you don’t need to say that there are always “K” nested groups within “G” higher order groups. You can instead just say there are “G” groups, and all groups within “G” are uniquely labeled (So that group ‘A’ within group G1 is not the same as group ‘A’ within group G2).

Flattening like that is usually easier than dealing with ragged arrays, if the model is amenable to it.

3 Likes

Since we’re talking about it I want to work out a model case for this in my head.

So let’s start with a basic hierarchical regression. If I have G groups and a regression like:

y ~ (1 | group)

We’d write this in Stan like:

data {
  int N;
  int groups[N]; // lower 1, upper G
  real y[N];
}
parameters {
  vector[G] mu;
}
model {
  y ~ normal(mu[groups], 1.0)
}

And then in some datasets we don’t use all G groups, so some of the mu values are unused.

If we’re fitting a term like (1 | group1:group2) we might think about our hierarchical parameters like a matrix:

parameters {
  matrix[G1, G2] mu;
}
model {
  for(i in 1:N) {
    y[i] ~ normal(mu[group1[i], group2[i]], 1.0);
  }
}

but not all of the entries in this matrix are used so this is a problem. Like if ‘x’ means ‘used’ and ‘o’ means ‘not used’ then our matrix might be like:

[ x x o o,
  x o x x,
  ...]

This matrix is also how we would think of some sort of nested thing, because in a rested regression our responses are varying on multiple group indices, we’ve just organized the regressions in our head in a nested way cause it is nicer to think about.

Right now to work with these matrices, they have to be flattened into vectors. This is annoying to do since this flattening is a function of data and could be different even between the model block where you’re fitting some data and generated quantities where you want to predict something else.

Is that all on the right path?

1 Like

Not sure I follow the flattening thing. If you just mean they are accessed via indices that vary per subject, then yeah. I would find the whole thing a bit more manageable with functionality like R: (edit: fixed ==)

real NA=99999.0;
int yindex[n1, n2];
int nobs[n1];
for(i in 1:n1){
  nobs[i] = sum(which(y[i,] != NA] ) ;
  yindex[i, 1:nobs[i] ] = which(y[i,] != NA] ) ;
y[i,yindex[i] ] = some operation;
}

But obviously ragged would be much easier ;)

1 Like

I looked at the design doc (https://github.com/stan-dev/design-docs/pull/7) and it wasn’t obvious to me how it fixed the whole problem so lemme keep asking.

With ragged arrays we turn the matrix:

[[ Y1 NA Y3 NA,
   Y1 Y2 NA Y4,
   ... ]]

into:

[[ Y1 Y3,
   Y1 Y2 Y4,
   ... ]]

And I think that’s what your code is doing too (should the == NA be != NA though?)

What bothers me about this is that there’s still a data-dependent reordering from the n1 x n2 matrix to this ragged representation which seems to be the easy way to organize this information (since we’re taking the time to transform from it and are apparently thinking of it in that format).

I could be having a slow day but I’m mostly confused by that last post… you don’t see how it fixes the problem, then you see that my code is like the ragged plan, then you’re worried that it’s still data dependent? From my perspective, if I don’t need anything approximating that pseudostancode I posted, it’s fixed, but I think you have other problems in mind.

Nono, I think the code you wrote improves the situation a lot, but let me clarify my other question about it.

So we have a parameter u_{i, j}.

Some elements of u_{i, j} are not used, so we want to not sample them in our Stan programs.

One thing we can do is flatten this matrix to a vector, v_{k(i, j)} = u_{i, j} where k(i, j) indexes all the non-NA values of u_{i, j} (so v only has one index).

Another thing we can do is ragged arrays r_{i, k_i(j)} = u_{i,j} where k_i(j) is a per-row mapping of the K_i non-NA indices j to a sequence of numbers 1, ..., K_i (take all the non-NAs in a row and squish them into a list).

So I see how the second remapping (ragged arrays) is simpler than the first (flattening to a vector).

It still seems like the more convenient thing though would be if you could just define u_{i, j} in a way that turned off all the elements that weren’t being used. That way there is no remapping at all. If you index’d a value of u_{i, j} that was turned off, maybe there could be an error.

This is I think is where this post started – the idea there was just put normal(0, 1) priors on the unused parameters and ignore them.

Does that make sense? The ragged remapping is simpler cause you’re just flattening one index, but if it were possible to just index the original, big, multidimensional thing with some elements turned off that would be nice too.

1 Like

Oh. Sure. Like NA’s in R ? I guess that would be popular :)

1 Like