Dealing with missing data in a data matrix

xiehw · November 13, 2019, 1:17am

Hi Stan team and other statistics experts,

I’m currently working on a problem in which I’m trying to fit compare the fit of several ODE models using LOO and WAIC to a matrix data set. An issue with this data set are that the rows differ in sparsity, so you have something like the following for the data, in which the top row denotes time of the measurement:

\begin{bmatrix} t_1 & t_2 & t_3 & t_4 & t_5 & t_6 & t_7 & t_8 & t_9 & t_{10} \\ 102 & 105 & 110 & 100 & 98 & 97 & NA & 99 & NA & 97 \\ NA & NA & NA & NA & 10 & NA & 8 & NA & 8.5 & NA\end{bmatrix}.

Reading this previous StackExchange question and subsequent response from Bob Carpenter, I’m just wondering whether it would be possible at all to fit to the data in Stan without also needing to establish some sort of model for filling in the missing data, as Stan does not handle NA’s. My goal is to simply use the additional data that is there to provide more information for the fit without also needing to predict/interpolate these additional missing values. Thanks for any possible help.

bbbales2 · November 13, 2019, 1:05pm

What is the mechanism by which measurements are missing here? I think that’s what’ll largely determine if you need to do something special (infer missing values) or not.

xiehw · November 14, 2019, 12:08am

Hi @bbbales2, the mechanisms are different for the constituent data vectors that make up the matrix. Loose background for the data is that it is from some papers that took different measurements for a single experiment, which was a longitudinal ecological experiment. The less sparse data row has data missing due to perhaps what can be considered as attrition. Dates were missed for sample collection, and it would be probably straightforward to interpolate the missing data. The more sparse data row by design had measurements only on a few dates. It won’t be feasible to impute or interpolate that data.

Is there a means in Stan where I can completely ignore the NA’s in matrix form?

bbbales2 · November 14, 2019, 12:26am

You can definitely just evaluate the likelihood at places where you have data. You’ll need to substitute some placeholders for those NA values so that the data can be passed into Stan (NAs are special numbers in R that don’t translate into Stan), but those can be anything since we’ll ignore them (infs are appropriate).

What you want to do is make a separate data structure to store where the non-NA values are. Like something like this:

data = c(1.0, inf, 2.0)
use_this_data = c(1, 0, 1)

And then when you go to use this in Stan you can write a loop like:

for(i in 1:N) {
  if(use_this_data[i] == 1) {
    data[i] ~ normal(whatever, etc);
  }
}

In that way you’re only using the data marked to be used.

Hmm, I’m not super familiar with missing data, but I think that you need to explicitly model the missing data if the missingness is informative of the thing you are trying to infer.

So if you are trying to infer the temperature of something, but your thermometer breaks and gives bad data when T > 100, then you shouldn’t just ignore all the places where the thermometer gave you bad data cause those would be the places with really high temperature!

So I guess you’d have to think about this for each problem?

edit: Fixed problem in code pointed out by @xiehw later on (Dealing with missing data in a data matrix - #6 by xiehw)

xiehw · November 14, 2019, 12:35am

This was extremely helpful, thank you so much!

So I guess you’d have to think about this for each problem?

Yeah, it’s looking like I will have to handle the missing data in each matrix row differently.

xiehw · November 14, 2019, 12:48am

One more follow-up question @bbbales2, I apologize. In the pseudocode,

data ~ normal(whatever, etc);

should be

data[i] ~ normal(whatever, etc);

right?

Not trying to be a stickler, just want to make sure I’m not missing something obvious. Thank you!

bbbales2 · November 14, 2019, 12:49am

Yeah, good catch! I’ll update the answer

xiehw · November 14, 2019, 2:01am

Thank you!

And, in the likelihood evaluation part of the model block where the loop is occurring, I’m going to have to melt the data, I presume? Seems like that according to this: https://mc-stan.org/docs/2_20/functions-reference/normal-distribution.html

bbbales2 · November 14, 2019, 2:05am

You could do the same thing with 2d data structures:

for(i in 1:I) {
  for(j in 1:J) {
    if(use_this_data[i, j] == 1) {
      data[i, j] ~ normal(whatever, etc);
    }
  }
}

But that’ll be comparatively inefficient if you have a lot of missing entries in the 2d structure. Otherwise yeah you’ll need to melt/flatten things out somehow. It’s kinda inconvenient.

Topic		Replies	Views
Stan does not support NA in data Modeling	8	3198	March 7, 2018
Missing response model (section 10.3 of Stan manual) Modeling	11	2444	May 24, 2017
Missing data in Stan - some difficulties understanding Modeling	6	575	August 16, 2021
Missing parameters and priors Modeling	25	810	June 27, 2020
Missing data Modeling	1	602	October 6, 2018

Dealing with missing data in a data matrix

Related topics