Padding values in response arrays with different sizes

Hello, everybody.

I have multinomial observations for N individuals over T times and have to iterate through it (T).
For a specific reason, I don’t have all N observations for every t in T.

As I’m modeling a categorical logistic regression, my response data has to be an integer and my idea was to create an array like

  int Y[N, 1, T]; // Answers

and iterate

  for(t in 1:T){
    target += categorical_logit_glm_lpmf(Y[,1,t] | ...);
  }

but I can’t do it because of the different dimensions of t’s.

Reading the discussion in How to pass an array of integer arrays of different lengths into Stan?, I understood that the best way to solve the problem was padding Inf or -Inf values on “incomplete” arrays, but it raised me two doubts:

1-) How can I write -Inf as a number in my data?
2-) As I’m padding in the response vector, won’t Stan understands that the -Inf number is a new response category?

Thanks!

1 Like

The general idea is to fill the NAs with something that’s acceptable as an integer to Stan but that will cause Stan to complain loudly if one of these data points that should be NA accidentally slips into analysis. Since your response distribution is categorical, you could use any negative integer for the Y values that should be NA. I like to use -99999 because it is usually visually obvious when glancing down a table of the raw data.

Then, you need to keep these disallowed values out of your call to categorical_logit_glm_lpmf. My preferred way to do this is to format the data so that all NAs are trailing within their rows, and then to pass as data int little_n[T] giving the number of genuine values in each row. Then you can do:

  for(t in 1:T){
    target += categorical_logit_glm_lpmf(Y[1:little_n[t],1,t] | ...);
  }

You’ll also need to format your covariates (if you have any) in an appropriate way so that you can index them properly on the right-hand side of the sampling statement, also by using 1:little_n[t].

As an aside, why do you use int Y[N,1,T] instead of int Y[N,T]?

Thanks a lot for your answer, @jsocolar!
Just to verify if I understood, your solution is to fill the vectors with -99999 value and create a new index that will guide my model to read data just until the first -99999 observation, correct?
Then, I also adjust my covariates matrices with the same logic.

I was using

int Y[N,1,T]

because I was trying to enter my response vectors as a list with different lengths, but I realized that it didn’t work and forgot to change. Now I will use

int Y[N,T]

thanks for that observation too :)

yup, that sounds right!

1 Like

I don’t know the reason, but using

int Y[N,1,T]

instead of

int Y[N,T]

is faster.

Here are some Total Elapsed Times for my model.
The first is comparable to the third and the second to the fourth.

       chain:1 chain:2 chain:3 chain:4
fit1.1 133.655 141.312 152.418 140.107
fit1.2 155.500 159.835 141.618 148.154
fit2.1 125.247 170.338 128.658 135.469
fit2.2 119.447 139.534 140.474 145.890

fit1.1 and fit 1.2 uses

int Y[N,T]

and the others

int Y[N,1,T].

Just for what it’s worth, eyeballing those numbers I don’t see clear evidence that one is faster than the other. Four or eight samples isn’t likely to be sufficient when the timings range from 125 to 170. My fairly strong prior is that adding an extraneous index in the middle position of an array declaration is very unlikely to speed anything up, but stranger things have certainly happened! If you can confirm that the speedup is real and provide a reproducible example it might be of interest to the developer community here (or maybe they already understand exactly why this would be the case).

1 Like