Missing value imputation in 3 dimensional array of count data

Hello all,

I need some help with how to deal with imputation of missing data in my model (in rstan). Its a very simple model based on surveys of 62 species in 20 years at 6 sites. I have my data organized into a 3 dimensional array (spp x site x year). It treats the 6 sites as replicates and assumes a Poisson distribution to return a mean expected abundance for each species in each year. It works perfectly when I only include the years where we surveyed all 6 sites, but there are some years where we only surveyed 4 or 5 of the 6 sites and consequently, the data frame has NAs where these data would be.

Could anybody give me some direction as to how I can go about imputing data for the missing surveys based on the completed surveys in a particular year? I’ve done quite a bit of searching but have been unable to find a solution pertaining to the 3 dimensional array that I have. I’m having trouble wrapping my head around creating an index of the missing values while retaining the 3 dimensional structure of the data frame. Thanks!

dat<-list(y=allcountsarray, N=nyears, S=nsite, T=nspp)

ComStan = "
  data {
    int<lower=0> N;  // number of years
    int<lower=0> S;  // number of sites
    int<lower=0> T;  // number of species
    int<lower=0> y[T,S,N];  // data array (counts)
  }
  
parameters {
  matrix[T,N] log_lambda;  // log of rate parameter
}

model {

for (i in 1:T)
 for (j in 1:N)
   y[i,,j] ~ poisson_log(log_lambda[i,j]);
  
  // prior
  to_vector(log_lambda) ~ normal(-2,12);
}

generated quantities {
matrix[T,N] lambda;
for (i in 1:T) 
  for (j in 1:N)
{
  lambda[i,j] = exp(log_lambda[i,j]);
}
}
"
1 Like

You need to replace the NAs with an arbitrary non-NA value, pass a 4D data variable that indicates the multidimensional index of each missing value, declare a parameter for each then loop to replace the arbitrary-non-NA values with the parameter values.

*edit: just saw this is your first post – welcome @anagy! Feel free to let us know if we can clarify anything.

I usually use one of the following strategies for such cases:

  1. Low effort, but kinda hacky: use -1 to encode “missing” in y, and construct an array of 0’s and 1’s to indicate whether y is observed. This makes imputation easy in the generated quantities block, e.g.,
data {
  int<lower=0> N;  // number of years
  int<lower=0> S;  // number of sites
  int<lower=0> T;  // number of species
  int<lower=-1> y[T,S,N];  // data array (counts)
}

transformed data {
  int<lower=0, upper=1> y_observed[T,S,N];

  for (i in 1:T) {
    for (s in 1:S) {
      for (j in 1:N) {
        if (y[i, s, j] < 0) {
          y_observed[i, s, j] = 0; // missing
        } else {
          y_observed[i, s, j] = 1; // observed
        }
      }
    }
  }
}
  

parameters {
  matrix[T,N] log_lambda;  // log of rate parameter
}

model {

  for (i in 1:T) {
    for (s in 1:S) {
      for (j in 1:N) {
        if (y_observed[i, s, n])
          y[i, s, j] ~ poisson_log(log_lambda[i, j]);
      }
    }
  }
  // prior
  to_vector(log_lambda) ~ normal(-2,12);
}

generated quantities {
  int<lower=0> y_rep[T,S,N];
  
  for (i in 1:T) {
    for (s in 1:S) {
      for (j in 1:N) {
        y_rep[i, s, j] = poisson_log_rng(log_lambda[i, j]);
      }
    }
  }
}
  1. Cleaner, but requires more data munging: reformat the 3d array into “long format” data, where you exclude missing values. This would give you data like:
data {
  int<lower=1> N;  // number of years
  int<lower=1> S;  // number of sites
  int<lower=1> T;  // number of species
  int<lower=1> n_obs; // number of (non-missing) observations
  
  // indices for observed values
  int<lower=1, upper=N> year[n_obs];
  int<lower=1, upper=S> site[n_obs];
  int<lower=1, upper=T> species[n_obs];

  // long-form observations
  int<lower=0> y[n_obs];
}

You can then impute in the generated quantities block for whatever year/site/species combinations you want.

1 Like

I really appreciate the help with this! Its awesome to have an active community to help work through problems because none of my colleagues have worked with Stan before.

Your first solution is along the lines of what I was thinking, but couldn’t quite figure out how to do.

I replaced NAs with -1 in my data frame and tried your code, and got the following error:

SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Sampling statements (~) and increment_log_prob() are
only allowed in the model block or lp functions.
error in ‘model2a18499b5f75_e7414e6410a19cfe3db2d805613903cd’ at line 50, column 24

48:     for (s in 1:S) {
49:       for (j in 1:N) {
50:         y_rep[i, s, j] ~ poisson_log_rng(log_lambda[i, j]);
                           ^
51:       }

Any thoughts? I’m a bit lost as to where the imputation needs to be moved in order to inform lambda estimates.

Also; do I need to alter the model block to;
if (y_observed[i, s, j])=1
to generate lambdas based on only the present values?

Also; I have a long data form of these data e.g.:
'data.frame': 7440 obs. of 4 variables: $ taxon : chr "Ambloplites_ariommus" "Aplodinotus_grunniens" "Campostoma_oligolepis" "Cottus_carolinae" ... $ site : Factor w/ 6 levels "con015","con131",..: 1 1 1 1 1 1 1 1 1 1 ... $ variable: Factor w/ 20 levels "X1996","X1997",..: 1 1 1 1 1 1 1 1 1 1 ... $ value : int 0 0 15 42 1 25 0 2 30 0 ...

But I’m less familiar with how to create a Stan model that uses variable grouping for multiple estimate generation.

Could you share the full .stan file? That would help diagnose the problem.

The first error is my mistake! Replace ~ with = and you should be good to go. The error message is correct - “sampling” statements (using ~) can only go in the model block - not the generated quantities block.

Second, it would be redundant to check that y_observed[i, s, j] == 1 in the if statement - you can simply use if(y_observed[i, s, j]) and that will automatically treat y_observed[i, s, j] == 1 as satisfying the condition.

Thank you so much. Replacing ~ with = worked.
Also I apologize for what may be silly questions, I’m new to Stan (and coding in general).

The model code I posted in the original post is actually the entirety of the code. All I want is model that returns the expected abundance for species T in year N, using six data points (sites) as replicates for each year and assuming a that they’re draws from some underlying poisson distribution. The mean of that distribution is the expected abundance. The outcome I’m looking for is the lambda [i,j] in the generated quantities block of my original post.

The way I understand your code, y_rep returns a log-lambda for each species in each year at each site. Does it make sense to just include my original generated quantity block;

matrix[T,N] lambda;
for (i in 1:T) 
  for (j in 1:N)
{
  lambda[i,j] = exp(log_lambda[i,j]);
}

Here y_rep represents counts generated from the posterior predictive distribution (Poisson-distributed integers), as opposed to log_lambda which is the log expected value. It’s up to you whether you want to generate counts (y_rep) or whether computing the expected values (log_lambda) is sufficient for your needs!

1 Like