Dealing with data subsetting in Stan

Hey all,

@andrewgelman brought up a proposal for data subsetting in Stan in another thread and I wanted to ask him to give more details about what the proposed syntax would look like. Here’s what he already wrote:

@andrewgelman, what would passing the index entries into the Stan program look like? Is this not solved by indexing by an array of indices, e.g.

int indices[2] = {1, 3};
int y[3] = {21, 25, 30}; 
// y[indices] == {21, 30}

Sean:

Perhaps the issue is the transition between R (or Python) and Stan. Suppose in the above example that I have a total of N data points, with N some big number such as 15000. My data structure in R will look like this:
vector[N] y;
int<lower=1, upper=J> patient_index[J];
vector[N] x;

Here, J is the number of patients, which is 200 in this case.

And my model is something like
y ~ normal(a[patient_index] + b*x, sigma);

When I do data subsetting, I’ll have some partition of patients into M (which in this example is 10) subsets, thus:
int<lower=1, upper=M> subset_index[J];

Then I’ll want to fit the model M times, each with the subset of patients corresponding to subset_index==m.

To do this in Stan, I guess that subsetting the data is not such a problem; I can do something like y[subset_index==m,] (that’s R syntax but I assume that something similar is possible in Stan). The difficulty is subsetting the parameters, as I’ll need to define a new vector “a” of length equal to length(subset_index==m) and then keep track of the index numbers.

2 Likes

This would be GroupBy with index over data and parameters?

I have also run into this issue (not surprising given Andrew and I work on similar things). My understanding is that it is not easy to do y[subset_index==m,], but if there is, that would be awesome! :)

Lauren

2 Likes

I don’t think we support y[subset_index==m,] yet - so that seems like it could help here, but it sounds like there’s still an issue even if we had that feature. Just to make sure I understand I’m going to try to flesh out Andrew’s example a little more. So in that example we’d also have

parameters {
  real a[J];
}

Did you also mean to type
int<lower=1, upper=J> patient_index[N];
instead of
int<lower=1, upper=J> patient_index[J];?
And J here is the number of patients, N is the number of data points, so we’re grouping data by patient.

So the issue comes when you want to operate over a subset of the patients. And right now you have to do this in R such that when you pass in all of the data there is now a new N and a new J, and that would work with your existing model but then the data mangling happens in R, is that right? What would it look like in Stan? To me it sounds like you’d just add a transformed data block that does subsetting and creates new N1 and J1 variables for the rest of the parameters. Something like the following if you pretend Stan has the vectorized equality mentioned previously:

data {
  int M;
  int<lower=1, upper=M> subset_index[J];
  int m;
}
transformed data {
  int J1 = length(subset_index==m);
  int N1 = length(y[subset_index==m]);
}
parameters {
    real a[J1];
    real b;
}

And now if you pass in M=1; subset_index = rep(1, J) you get the full model and otherwise you construct the subset_index as you indicated.

I think that’s the situation now modulo that subset_index==m vectorized equality functionality that we’re missing. @andrewgelman what would you like on top of that in Stan syntax?

Yes, I meant int<lower=1, upper=J> patient_index[N];

I guess the transformed data idea would work. Then after running the model we’d just have to map the subset parameter vector “a” with length J1 on to the appropriate elements of the longer parameter vector of length J. I guess that would be done in R or Python.

Yeah - sorry, just to be clear I was just trying to lay out more details about the problem, not shut down your feature request! Did you have ideas for new syntax or functions that would facilitate this pattern?

I’m not sure. It could be worth talking with Aki about this, because it’s related to subsetting issues that arise with EP.

Well, you don’t have that yet, but you can get there with these two little functions which I use all the time:

// count number times elem appears in test set
int count_elem(int[] test, int elem) {
  int count;
  count = 0;
  for(i in 1:num_elements(test))
    if(test[i] == elem)
      count = count + 1;
  return(count);
}

// find elements in test which are equal to elem
int[] which_elem(int[] test, int elem) {
  int res[count_elem(test, elem)];
  int ci;
  ci = 1;
  for(i in 1:num_elements(test))
    if(test[i] == elem) {
      res[ci] = i;
      ci = ci + 1;
    }
  return(res);
}

With that under your belt you can do things like

int sub_idx[count_elem(subset_index, m)] = which_elem(subset_index, m)

and then use sub_idx to subset your data.

Nothing new needed in Stan for this.

What is a bit of an annoyance is the lack of ragged arrays. Often the patient data is ragged and you need to deal with it in creative ways - but so for I was able to code around them; my utils.stan file is ~500 lines full of stuff like this.

7 Likes

Coming soon! The proposal is out for comments.

That sounds like a strong argument for including functions like this in Stan itself.

When you say fit the model M times, you mean M independent runs of Stan?

Could you say more about why you want to do that? There might be an easier way to accomplish your goal.

That seems more like a problem for feeding data into Stan and comparing it on the outside. Because Stan requires parameters to be numbered contiguously, this is a huge pain if you want to pull out some of them.

If you don’t mind that there is no data for some parameters in some runs, you can literally just subset the data and everything will work. As long as you have proper priors, this should work.

Otherwise, I think we have to start working back from an example to see what you want to do.

In principle, yes. In practice this means to put this data munging function into Stan-math, right? That’s a large burden to do (C++, tests, doc) for a small utility function. There was always the idea to have a r-stan-utilities package or something similar. It would be very useful to have a contributed stan functions repository. The crux is ensuring things are documented and tested. We have never solved this problem as we haven’t found a way to make this as lean as possible (and yet sufficiently rigorous).

The thing is if I use a function at my own risk, then that’s one thing, but putting it out into the wild means we have to catch all abuses of the functions and what if not. I am more than happy to share all those utilities, but rather as part of a bigger example in the example repository, for example.

Something similar as “Analysis API” would be ideal for PyStan(3).

It’s easy to add utility functions.

It’s changing the entire autodiff infrastructure that’s painful!

Does it actually have to go in Math? There are a bunch of auxiliary functions in the Stan repo, though they might not be exposed in the Stan language with the same name (e.g. x = y; calls stan::model::assign(x, y); under the hood, and that function lives in Stan). We could also just add the vectorized equality operators directly - @Bob_Carpenter did you have a proposal for that somewhere? I think you mentioned it to me once a while back.

I think container equality should be by value. If they have the same dimensions and all elements are equal, they’re equal.

Ahh, okay, you were just thinking of checking if container1 == container2. I was thinking of the R feature where you can create a boolean array with container1 == scalar (or container1 > scalar as well) - not sure what those are called. Then we can look at functionality for selecting elements from a container based on a boolean array - not sure if that should just be more indexing syntax (container[container > 4]) or a named function (where(container > 5, container).

I also don’t know a name for these. If we have y : vector and y > 0 : int[], then we can have a select function select(y, y > 0) : vector,

with usage

vector[N] y = {1, 1, 2, 1, 2, 3, 1, 2, 3, 4}
vector[sum(y > 1)] u = select(y, y > 1);

after which u will be equal to { 2, 2, 3, 2, 3, 4 }.

R keeps a record of whether a vector is the result of logical comparisons.

> y <- c(1, 2, 3)

> str(c(1,1,1))
 num [1:3] 1 1 1

> y[c(1, 1, 1)]
[1] 1 1 1

> str(y > 0)
 logi [1:3] TRUE TRUE TRUE

> y[y > 0]
[1] 1 2 3

> c(1, 1, 1) == (y > 0)
TRUE TRUE TRUE

I forgot to mention that it gets confusing if we have real > vector, vector > vector and vector > real. I could imagine these all evaluating to either booleans (reduce with and) or arrays of booleans. I guess it’d be easy enough to have booleans like every(y > 2) or some(y > 2), though it’d be a bit less efficient than internal reduction.