Subsetting a vector/matrix with a boolean (application to DLM)

Hi All,

I am trying to implement a multivariate dynamic linear model where many observations Y-values are NA. To get around the NA’s, I coded all NAs at 10 (since the range of data is much small than 10) and created an indicator data array D which has value 1 when Y!=10 and 0 otherwise. It appears rstan does not have boolean indexing. This is the error message I get:

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: Unrecoverable error evaluating the log probability at the initial value.
Chain 1: Exception: matrix[multi,multi] row indexing: accessing element out of range. index 0 out of range; expecting index to be between 1 and 11 (in 'string', line 51, column 4 to column 75)
[1] "Error : Exception: matrix[multi,multi] row indexing: accessing element out of range. index 0 out of range; expecting index to be between 1 and 11 (in 'string', line 51, column 4 to column 75)"
[1] "error occurred during calling the sampler; sampling not done"

Any suggestions on how to alter this code so that it handles the NA’s? There may be other issues with this code, but the error I am trying to solve right now is the indexing one.

Code below:


data {
  int<lower=0> N ; // number of observations
  int<lower=0> S ; // number of streams
  matrix[N,S] Y ; // matrix of stream observations, including NAs (coded as 10)
  array[N,S] int<lower=0,upper=1> D ; // matrix of indicators where D = 0 when Y = NA (i.e. 10)
  real<lower=0.0> g ; // smoothness parameter (bigger = smoother)
}

parameters {
  corr_matrix[S] R ;
  vector[S] latent_log_sigma ;
  matrix[S,N] z ;
  vector[S] z_0 ;
}

transformed parameters{
  vector<lower=0.0>[S] sigma = exp(-1+sqrt2()*latent_log_sigma) ;
  matrix[S,S] L = cholesky_decompose(R) ;
  matrix[S,N] mu ;
  {
    mu[,1] =  g * sigma .* ( L * z_0 ) ;
    for(j in 2:N){
      mu[,j] = mu[,j-1] + g * sigma .* (L * z[,j]) ;
    }
  }
  matrix[S,S] Sigma = quad_form_diag(R,g*sigma) ;
}

model {
  
  //priors and latent processes
  
  for(s in 1:S){
    target+= std_normal_lpdf(latent_log_sigma[s]) ;
    target+= std_normal_lpdf(z_0[s]) ;
    for(j in 1:N){
      target+= std_normal_lpdf(z[s,j]) ;
    }
  }
  
  // likelihood
  
  for(j in 1:N){
    target+= multi_normal_lpdf(Y[j,D[j,]]|mu[D[j,],j],Sigma[D[j,],D[j,]]) ;
  }
}

You would need to build an index of the values you want to run. The requires building the size of the indexer and then building the index.

Adapted from some of my own code:

for (n in 1:N) {

 // declare the index-sizing variable
 int m[S];

 // build the logical vector of non-zero trials
 for (tt in 1:S) {
  m[tt] = Y[N,S] != 10;
 }

 int t;
 int ind[sum(m)];

 for (tr in 1:S) {
  if Y[N,S] != 10) {

   // add this trial to the index
   ind[t] = tr;

   // increment the index
   t = t + 1;
  }
 }
}

In my code, the purpose of this was to allow me to only put expensive computations inside of the conditional statement if it was for valid trials, and then use the index that was concurrently built to exclude NaNs from further computation. In my code:

llhC += sum(columns_dot_product(resps[n][ind], log(lhC[ind])));

As you can see, it allowed me to skip over calculating the likelihood of invalid choices before multiplying them by the response array. In particular, I was skipping over trials corresponding to [0,0,0,0] rows in my response arrays because I padded my response arrays with those zero rows so they all had the same number of trials. The model would have run tons of unnecessary computations if I didn’t exclude those trials and then index in this way to make sure only valid trials were ultimately added as log likelihoods to the posterior.

That’s right. You have to do it manually.

If you have a separate indicator array, you don’t need to worry about the 10 values and they can just be made NaN.

IYou don’t want to declare R to be a correlation matrix and then cholesky decompose. It’s better just to declare L as a parameter that’s a Cholesky factor of a correlation matrix. You can reconstruct R = L * L' in the generated quantities block if necessary.

You also don’t want to create Sigma in dense form. Better to continue working with Cholesky factors all the way through for efficiency and numerical stability. That is, define L_Sigma and then use multi_normal_cholesky_lpdf to take the covariance matrix in Cholesky factor form.

Thanks @Bob_Carpenter and @Corey.Plate so much for you help. @Bob_Carpenter I’m enjoying all the notes and advice on computational efficiency in Stan, so thank you. @Corey.Plate This looks like it’ll work, but I’ll need a few more days to digest it and update my code.