Ragged data and mixture (occupancy) model

jcwi · August 26, 2020, 8:23pm

Hi, facing what seems like a really basic problem relating to a ragged data matrix and a row-structured likelihood, but having difficulty finding comparable examples. I think the easiest example for me to describe this involves something like a zero-inflated binomial model with this sort of likelihood (modified a little from Maxwell Joseph’s post here–https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/:

data {
  int<lower = 1> N;
  int<lower = 1> K;
  int<lower = 0, upper = 1> y[N, K];
  int<lower =0, upper =1> max_y[N];
}

parameters {
  real<lower = 0, upper = 1> p;
  real<lower = 0, upper = 1> psi;
}

transformed parameters {
  vector[N] log_lik;
  
  for (i in 1:N) {
    if (max_y[i] > 0) {
      log_lik[i] = log(psi) + bernoulli_lpmf(y[i,] | p);
    } else {
      log_lik[i] = log_sum_exp(
        log(psi) +bernoulli_lpmf(y[i,] | p), 
        log1m(psi)
      );
    }
  }
}

model {
  target += sum(log_lik);
  target += uniform_lpdf(p | 0, 1);
  target += uniform_lpdf(psi | 0, 1);
}

These models are well supported and described in various places (e.g., https://github.com/stan-dev/example-models/tree/master/BPA), but I have not seen examples that cover the common situation where there are missing y values and variation in the parameter p (e.g., matrix[N, K] p). I’m trying to figure out the easiest way to sum something like log(psi[i])+log(y[i,]|p[i, ]) while only grabbing the elements of y[i,] that were actually sampled. My understanding is that stan does not yet support ragged arrays or nested indexing like y[i, which_sampled[i]]. What I have tried is something like a):

for (i in 1:nsites) {
          vector[2] lp;
            if (ymax[i] > 0) {
               lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i, ])*sampled[i, ]);
                          //here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
                          //if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
               lp[2]=0;
          }else{
               lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i,])*sampled[i, ]);
               lp[2]=log1m(psi[i]);
          }
     target+=log_sum_exp(lp);
  }

but my reading of the manual is that bernoulli_lpmf(y[i,1:K]|p[i,1:K]) produces a scalar? (i.e.,the implementation above is quite wrong).

My other very hamfisted approach has been to define y as a matrix (rather than an integer type) and use element-wise operators:

for (i in 1:nsites) {
      vector[2] lp;
          if (ymax[i] > 0) {
             lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
             //here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
             //if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
             lp[2]=0;
          }else{
             lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
             lp[2]=log1m(psi[i]);
          }
    target+=log_sum_exp(lp);
 }

I think this is at least using the correct log-likelihood, but seems tremendously inefficient. I understand that something like the segment operator could work for y[i,]|p[i,], but trying preserve the existing column structure for the y matrix and varied covariate arrays for a few reasons. Is there some other obvious approach I’m missing?

Thanks,

John

mbjoseph · August 26, 2020, 9:29pm

I usually use an even more ham fisted approach with if statements to skip over missing data e.g.,

for (i in 1:nsites) {
  real lp_if_present = log(psi[i]);

  for (j in 1:nsurveys) {
    if (sampled[i, j])
      lp_if_present += bernoulli_lpmf(y[i, j] | p[i, j])
  }

  if (ymax[i] > 0) {
    log_lik[i] = lp_if_present;
  } else {
    log_lik[i] = log_sum_exp(lp_if_present, log1m(psi[i]));
  }
}

Here sampled is a binary indicator for whether site i was sampled on survey j. This allows you to retain the column structure for y and p, if that’s important.

jcwi · August 26, 2020, 9:42pm

Thanks(!) and facepalm–I’m truly becoming an expert at confusing the hell out of myself trying to avoid doing something that works perfectly well.

Topic		Replies	Views
Dealing with ragged and partial data Modeling	2	677	December 21, 2017
Ragged data matrix and some questions using "segment" function Modeling	1	918	November 1, 2017
Log-Likelihood with different parts Modeling	4	509	July 31, 2020
Calculating log_lik for occupancy (mixture) models Modeling ecology	4	1260	March 30, 2018
Ragged arrays and time series operations Modeling	15	1902	February 23, 2018

Ragged data and mixture (occupancy) model

Related topics