Hi, facing what seems like a really basic problem relating to a ragged data matrix and a row-structured likelihood, but having difficulty finding comparable examples. I think the easiest example for me to describe this involves something like a zero-inflated binomial model with this sort of likelihood (modified a little from Maxwell Joseph’s post here–https://mbjoseph.github.io/posts/2020-04-28-a-step-by-step-guide-to-marginalizing-over-discrete-parameters-for-ecologists-using-stan/:
data {
int<lower = 1> N;
int<lower = 1> K;
int<lower = 0, upper = 1> y[N, K];
int<lower =0, upper =1> max_y[N];
}
parameters {
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;
}
transformed parameters {
vector[N] log_lik;
for (i in 1:N) {
if (max_y[i] > 0) {
log_lik[i] = log(psi) + bernoulli_lpmf(y[i,] | p);
} else {
log_lik[i] = log_sum_exp(
log(psi) +bernoulli_lpmf(y[i,] | p),
log1m(psi)
);
}
}
}
model {
target += sum(log_lik);
target += uniform_lpdf(p | 0, 1);
target += uniform_lpdf(psi | 0, 1);
}
These models are well supported and described in various places (e.g., https://github.com/stan-dev/example-models/tree/master/BPA), but I have not seen examples that cover the common situation where there are missing y values and variation in the parameter p (e.g., matrix[N, K] p). I’m trying to figure out the easiest way to sum something like log(psi[i])+log(y[i,]|p[i, ]) while only grabbing the elements of y[i,] that were actually sampled. My understanding is that stan does not yet support ragged arrays or nested indexing like y[i, which_sampled[i]]. What I have tried is something like a):
for (i in 1:nsites) {
vector[2] lp;
if (ymax[i] > 0) {
lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i, ])*sampled[i, ]);
//here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
//if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
lp[2]=0;
}else{
lp[1]=log(psi[i])+sum(bernoulli_lpmf(y[i,]|p[i,])*sampled[i, ]);
lp[2]=log1m(psi[i]);
}
target+=log_sum_exp(lp);
}
but my reading of the manual is that bernoulli_lpmf(y[i,1:K]|p[i,1:K]) produces a scalar? (i.e.,the implementation above is quite wrong).
My other very hamfisted approach has been to define y as a matrix (rather than an integer type) and use element-wise operators:
for (i in 1:nsites) {
vector[2] lp;
if (ymax[i] > 0) {
lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
//here sampled[i,] is a vector of 1's/0's denoting whether sampling occurred or not
//if sampled [i, j]=0, idea is that log(y[i,j]|p[i, j])=0
lp[2]=0;
}else{
lp[1]=log(psi[i])+dot_product(log((y[i,].* p[i,])+(1-y[i,]).* (1-p[i,])), sampled[i,]);
lp[2]=log1m(psi[i]);
}
target+=log_sum_exp(lp);
}
I think this is at least using the correct log-likelihood, but seems tremendously inefficient. I understand that something like the segment operator could work for y[i,]|p[i,], but trying preserve the existing column structure for the y matrix and varied covariate arrays for a few reasons. Is there some other obvious approach I’m missing?
Thanks,
John