# Calculating the likelihood for loo in a binomial mixture model

This is my first real attempt at using Stan (coming from a BUGS background). I am an ecologist trying to run a binomial mixture model. I’m running things through `rstan`. I think I have the model running well but I’m trying to capture the pointwise likelihood for use in `loo`. [EDIT] below is my code:

``````// Binomial mixture model with covariates
data {
int<lower=0> R;       // Number of transects
int<lower=0> T;       // Number of temporal replications
int<lower=0> y[R, T]; // Counts
vector[R] elev;          // Covariate
vector[R] elev2;          // Covariate
vector[R] litter;          // Covariate
matrix[R, T] RH;      // Covariate
matrix[R, T] temp;      // Covariate
matrix[R, T] temp2;      // Covariate
vector[R] gcover;      // Covariate
vector[R] gcover2;      // Covariate
int<lower=0> K;       // Upper bound of population size
}

transformed data {
int<lower=0> max_y[R];
int<lower=0> N_ll;
int foo[R];

for (i in 1:R)
max_y[i] = max(y[i]);

for (i in 1:R) {
foo[i] = K - max_y[i] + 1;
}
N_ll = sum(foo);
}

parameters {
real alpha0;
real alpha1;
real alpha2;
real alpha3;
real beta0;
real beta1;
real beta2;
real beta3;
real beta4;
real beta5;
}

transformed parameters {
vector[R] log_lambda; // Log population size
matrix[R, T] logit_p; // Logit detection probability

log_lambda = alpha0 + alpha1 * elev + alpha2 * elev2 + alpha3 * litter;
logit_p = beta0 + beta1 * RH + beta2 * temp + beta3 * temp2 + rep_matrix(beta4 * gcover, T);
}

model {
// Priors
// Improper flat priors are implicitly used on
// alpha0, alpha1, beta0 and beta1.

// Likelihood
for (i in 1:R) {
vector[K - max_y[i] + 1] lp;

for (j in 1:(K - max_y[i] + 1))
lp[j] = poisson_log_lpmf(max_y[i] + j - 1 | log_lambda[i])
+ binomial_logit_lpmf(y[i] | max_y[i] + j - 1, logit_p[i]);
target += log_sum_exp(lp);
}
}

generated quantities {
int<lower=0> N[R];                // Abundance
int totalN;
vector[R] log_lik;
real mean_abundance;
matrix[R, T] p;

// Likelihood for use in loo-cv
for (i in 1:R) {
vector[K - max_y[i] + 1] ll;

for (j in 1:(K - max_y[i] + 1)) {
ll[j] = poisson_log_lpmf(max_y[i] + j - 1 | log_lambda[i])
+ binomial_logit_lpmf(y[i] | max_y[i] + j - 1, logit_p[i]);
}
log_lik[i] = log_sum_exp(ll);
}

for (i in 1:R) {
N[i] = poisson_log_rng(log_lambda[i]);
p[i, 1:T] = inv_logit(logit_p[i, 1:T]);
}

totalN = sum(N);  // Total pop. size across all sites
mean_abundance = exp(alpha0);
}
``````

But I’m not sure how to capture the log likelihood. I’m currently trying the following in the generated quantities:

``````generated quantities {
for (i in 1:R) {
vector[K - max_y[i] + 1] ll;

for (j in 1:(K - max_y[i] + 1)) {
ll[j] = poisson_log_lpmf(max_y[i] + j - 1 | log_lambda[i])
+ binomial_logit_lpmf(y[i] | max_y[i] + j - 1, logit_p[i]);
}
log_lik[i] = log_sum_exp(ll);
}
}
``````

I’m not sure if this is capturing what I want or need for `loo` in general or even the likelihood properly. I am not even sure if I need `R` log likelihood values or if I need the sum of the lengths of the `ll` across all `R` sites (`N_ll` I think)?

My other question related to this model is if the binomial part of the `lp` is specified correctly? I am assuming it’s getting the probability for the series (vector of length T) for each of the R sites. As opposed to having to specify another for loop through the T visits to each site (making `y[i,t]` and `logit_p[i,t]`.

Thank you for any help while I try to learn the models better along with Stan and the associate samplers. I can also submit my whole script and data in needed but they are both a mess at the moment.

Hi Daniel, could you include the whole Stan code? Without data and parameters block there is too much guessing. I think I could figure out that R is the number of sites, and T is the number of visits? What is K? Do you want to predict for a new site, or for old sites but new visits=

I added the full stan code as I currently have it. You are correct that `R` is the number of sites and `T` is the number of visits to each site. `K` is the maximum possible (or larger) population size at any site. Then the likelihood can be calculated for each value from the minimum population size (`y_max` - the maximum number counted on any single visit) to `K` with the goal of estimating the true population size `N` that is somewhere between `y_max` and `K`.

My goals are to 1) estimate the parameters in the regressions (done well currently I think), 2) estimate `N[i]` and to a lesser extent `p[i, t]` (done well I think), 3) evaluate model fit (Bayesian p-value, calc not included) and loo-cv, and 4) compare models with WAIC. Predicting to new sites is probably of some interest.

Thanks!

Excellent, its more clear now.

loo-cv and WAIC are estimating the same thing, and there is no need to compute them both. PSIS-LOO in loo package has better diagnostics than WAIC, and thus is more reliable. In easy cases PSIS-LOO and WAIC give very similar results. In difficult cases, WAIC is more likely fail and if the results differ PSIS-LOO is more accurate or gives clear indication that both are failing. Thus computing WAIC is not giving you any additional information.

Both loo-cv and WAIC are not just measuring some arbitrary model fit. In usual case tphey both correspond to estimating the predictive performance for a single observation in one of the current sites. Model fit is only for how well would you predict the observation for the next visit in one of the current sites.

I now noticed that you have

``````binomial_logit_lpmf(y[i] | max_y[i] + j - 1, logit_p[i]);
``````

which already sums over the visits.
In the model block

``````target += log_sum_exp(lp);
``````

is matched in the generated quantities block with (although variable name has changed from `lp` to `ll`)

``````log_lik[i] = log_sum_exp(ll);
``````

As the each `log_lik[i]` includes the likelihood contribution from all the observation for site `i`, using these `log_lik` values for LOO (or WAIC) corresponds to leave-one-site-out predictive performance estimate.

2 Likes

Thanks! Leave-one-site-out predictive performance is definitely what I was primarily looking for. I’m glad to hear that I had it working correctly despite my recent and very brief introduction to Stan. I truly appreciate the amazing documentation, examples, and forums that you have all created.