Dealing with missing data in a data matrix

Hi Stan team and other statistics experts,

I’m currently working on a problem in which I’m trying to fit compare the fit of several ODE models using LOO and WAIC to a matrix data set. An issue with this data set are that the rows differ in sparsity, so you have something like the following for the data, in which the top row denotes time of the measurement:

\begin{bmatrix} t_1 & t_2 & t_3 & t_4 & t_5 & t_6 & t_7 & t_8 & t_9 & t_{10} \\ 102 & 105 & 110 & 100 & 98 & 97 & NA & 99 & NA & 97 \\ NA & NA & NA & NA & 10 & NA & 8 & NA & 8.5 & NA\end{bmatrix}.

Reading this previous StackExchange question and subsequent response from Bob Carpenter, I’m just wondering whether it would be possible at all to fit to the data in Stan without also needing to establish some sort of model for filling in the missing data, as Stan does not handle NA’s. My goal is to simply use the additional data that is there to provide more information for the fit without also needing to predict/interpolate these additional missing values. Thanks for any possible help.

What is the mechanism by which measurements are missing here? I think that’s what’ll largely determine if you need to do something special (infer missing values) or not.

Hi @bbbales2, the mechanisms are different for the constituent data vectors that make up the matrix. Loose background for the data is that it is from some papers that took different measurements for a single experiment, which was a longitudinal ecological experiment. The less sparse data row has data missing due to perhaps what can be considered as attrition. Dates were missed for sample collection, and it would be probably straightforward to interpolate the missing data. The more sparse data row by design had measurements only on a few dates. It won’t be feasible to impute or interpolate that data.

Is there a means in Stan where I can completely ignore the NA’s in matrix form?

You can definitely just evaluate the likelihood at places where you have data. You’ll need to substitute some placeholders for those NA values so that the data can be passed into Stan (NAs are special numbers in R that don’t translate into Stan), but those can be anything since we’ll ignore them (infs are appropriate).

What you want to do is make a separate data structure to store where the non-NA values are. Like something like this:

data = c(1.0, inf, 2.0)
use_this_data = c(1, 0, 1)

And then when you go to use this in Stan you can write a loop like:

for(i in 1:N) {
  if(use_this_data[i] == 1) {
    data[i] ~ normal(whatever, etc);
  }
}

In that way you’re only using the data marked to be used.

Hmm, I’m not super familiar with missing data, but I think that you need to explicitly model the missing data if the missingness is informative of the thing you are trying to infer.

So if you are trying to infer the temperature of something, but your thermometer breaks and gives bad data when T > 100, then you shouldn’t just ignore all the places where the thermometer gave you bad data cause those would be the places with really high temperature!

So I guess you’d have to think about this for each problem?

edit: Fixed problem in code pointed out by @xiehw later on (Dealing with missing data in a data matrix - #6 by xiehw)

1 Like

This was extremely helpful, thank you so much!

So I guess you’d have to think about this for each problem?

Yeah, it’s looking like I will have to handle the missing data in each matrix row differently.

One more follow-up question @bbbales2, I apologize. In the pseudocode,

data ~ normal(whatever, etc);

should be

data[i] ~ normal(whatever, etc);

right?

Not trying to be a stickler, just want to make sure I’m not missing something obvious. Thank you!

1 Like

Yeah, good catch! I’ll update the answer

Thank you!

And, in the likelihood evaluation part of the model block where the loop is occurring, I’m going to have to melt the data, I presume? Seems like that according to this: https://mc-stan.org/docs/2_20/functions-reference/normal-distribution.html

You could do the same thing with 2d data structures:

for(i in 1:I) {
  for(j in 1:J) {
    if(use_this_data[i, j] == 1) {
      data[i, j] ~ normal(whatever, etc);
    }
  }
}

But that’ll be comparatively inefficient if you have a lot of missing entries in the 2d structure. Otherwise yeah you’ll need to melt/flatten things out somehow. It’s kinda inconvenient.

1 Like