 # Ragged data and mixture (occupancy) model

Hi, facing what seems like a really basic problem relating to a ragged data matrix and a row-structured likelihood, but having difficulty finding comparable examples. I think the easiest example for me to describe this involves something like a zero-inflated binomial model with this sort of likelihood (modified a little from Maxwell Joseph’s post here–https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/:

``````data {
int<lower = 1> N;
int<lower = 1> K;
int<lower = 0, upper = 1> y[N, K];
int<lower =0, upper =1> max_y[N];
}

parameters {
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;
}

transformed parameters {
vector[N] log_lik;

for (i in 1:N) {
if (max_y[i] > 0) {
log_lik[i] = log(psi) + bernoulli_lpmf(y[i,] | p);
} else {
log_lik[i] = log_sum_exp(
log(psi) +bernoulli_lpmf(y[i,] | p),
log1m(psi)
);
}
}
}

model {
target += sum(log_lik);
target += uniform_lpdf(p | 0, 1);
target += uniform_lpdf(psi | 0, 1);
}
``````

These models are well supported and described in various places (e.g., https://github.com/stan-dev/example-models/tree/master/BPA), but I have not seen examples that cover the common situation where there are missing y values and variation in the parameter p (e.g., matrix[N, K] p). I’m trying to figure out the easiest way to sum something like log(psi[i])+log(y[i,]|p[i, ]) while only grabbing the elements of y[i,] that were actually sampled. My understanding is that stan does not yet support ragged arrays or nested indexing like y[i, which_sampled[i]]. What I have tried is something like a):

``````for (i in 1:nsites) {
vector lp;
if (ymax[i] > 0) {
lp=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i, ])*sampled[i, ]);
//here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
//if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
lp=0;
}else{
lp=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i,])*sampled[i, ]);
lp=log1m(psi[i]);
}
target+=log_sum_exp(lp);
}
``````

but my reading of the manual is that bernoulli_lpmf(y[i,1:K]|p[i,1:K]) produces a scalar? (i.e.,the implementation above is quite wrong).

My other very hamfisted approach has been to define y as a matrix (rather than an integer type) and use element-wise operators:

``````for (i in 1:nsites) {
vector lp;
if (ymax[i] > 0) {
lp=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
//here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
//if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
lp=0;
}else{
lp=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
lp=log1m(psi[i]);
}
target+=log_sum_exp(lp);
}
``````

I think this is at least using the correct log-likelihood, but seems tremendously inefficient. I understand that something like the segment operator could work for y[i,]|p[i,], but trying preserve the existing column structure for the y matrix and varied covariate arrays for a few reasons. Is there some other obvious approach I’m missing?

Thanks,

John

I usually use an even more ham fisted approach with `if` statements to skip over missing data e.g.,

``````for (i in 1:nsites) {
real lp_if_present = log(psi[i]);

for (j in 1:nsurveys) {
if (sampled[i, j])
lp_if_present += bernoulli_lpmf(y[i, j] | p[i, j])
}

if (ymax[i] > 0) {
log_lik[i] = lp_if_present;
} else {
log_lik[i] = log_sum_exp(lp_if_present, log1m(psi[i]));
}
}
``````

Here `sampled` is a binary indicator for whether site `i` was sampled on survey `j`. This allows you to retain the column structure for `y` and `p`, if that’s important.

2 Likes

Thanks(!) and facepalm–I’m truly becoming an expert at confusing the hell out of myself trying to avoid doing something that works perfectly well.

2 Likes