Ragged data and mixture (occupancy) model

Hi, facing what seems like a really basic problem relating to a ragged data matrix and a row-structured likelihood, but having difficulty finding comparable examples. I think the easiest example for me to describe this involves something like a zero-inflated binomial model with this sort of likelihood (modified a little from Maxwell Joseph’s post here–https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/:

data {
  int<lower = 1> N;
  int<lower = 1> K;
  int<lower = 0, upper = 1> y[N, K];
  int<lower =0, upper =1> max_y[N];
}

parameters {
  real<lower = 0, upper = 1> p;
  real<lower = 0, upper = 1> psi;
}

transformed parameters {
  vector[N] log_lik;
  
  for (i in 1:N) {
    if (max_y[i] > 0) {
      log_lik[i] = log(psi) + bernoulli_lpmf(y[i,] | p);
    } else {
      log_lik[i] = log_sum_exp(
        log(psi) +bernoulli_lpmf(y[i,] | p), 
        log1m(psi)
      );
    }
  }
}

model {
  target += sum(log_lik);
  target += uniform_lpdf(p | 0, 1);
  target += uniform_lpdf(psi | 0, 1);
}

These models are well supported and described in various places (e.g., https://github.com/stan-dev/example-models/tree/master/BPA), but I have not seen examples that cover the common situation where there are missing y values and variation in the parameter p (e.g., matrix[N, K] p). I’m trying to figure out the easiest way to sum something like log(psi[i])+log(y[i,]|p[i, ]) while only grabbing the elements of y[i,] that were actually sampled. My understanding is that stan does not yet support ragged arrays or nested indexing like y[i, which_sampled[i]]. What I have tried is something like a):

for (i in 1:nsites) {
          vector[2] lp;
            if (ymax[i] > 0) {
               lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i, ])*sampled[i, ]);
                          //here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
                          //if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
               lp[2]=0;
          }else{
               lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i,])*sampled[i, ]);
               lp[2]=log1m(psi[i]);
          }
     target+=log_sum_exp(lp);
  }

but my reading of the manual is that bernoulli_lpmf(y[i,1:K]|p[i,1:K]) produces a scalar? (i.e.,the implementation above is quite wrong).

My other very hamfisted approach has been to define y as a matrix (rather than an integer type) and use element-wise operators:

for (i in 1:nsites) {
      vector[2] lp;
          if (ymax[i] > 0) {
             lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
             //here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
             //if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
             lp[2]=0;
          }else{
             lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
             lp[2]=log1m(psi[i]);
          }
    target+=log_sum_exp(lp);
 }

I think this is at least using the correct log-likelihood, but seems tremendously inefficient. I understand that something like the segment operator could work for y[i,]|p[i,], but trying preserve the existing column structure for the y matrix and varied covariate arrays for a few reasons. Is there some other obvious approach I’m missing?

Thanks,

John

I usually use an even more ham fisted approach with if statements to skip over missing data e.g.,

for (i in 1:nsites) {
  real lp_if_present = log(psi[i]);

  for (j in 1:nsurveys) {
    if (sampled[i, j])
      lp_if_present += bernoulli_lpmf(y[i, j] | p[i, j])
  }

  if (ymax[i] > 0) {
    log_lik[i] = lp_if_present;
  } else {
    log_lik[i] = log_sum_exp(lp_if_present, log1m(psi[i]));
  }
}

Here sampled is a binary indicator for whether site i was sampled on survey j. This allows you to retain the column structure for y and p, if that’s important.

2 Likes

Thanks(!) and facepalm–I’m truly becoming an expert at confusing the hell out of myself trying to avoid doing something that works perfectly well.

2 Likes