# How to generate posterior predictive distribution for hierarchical occupancy model in Stan?

Hi I am new to Bayesian modeling and Stan. Hoping to get some guidance on generating a posterior distribution to compare to my observed y values of presences and absences.

I am using the below model which was developed by Richard Erickson (StanOccupancyModelTutorials/TwoLevelOccupancyModels.Rmd at master · raerickson/StanOccupancyModelTutorials · GitHub).

I mostly understand the theory, however the complexity of the likelihood is giving me a hard time in entering the generated quantities.

data {
int<lower=0> n_sampling_units;
int<lower=0> y[n_sampling_units];
int<lower=0> k[n_sampling_units];
}

parameters {
real muPsi;
real muP;
}

model {

real log_muPsi;
real log1m_muPsi;

log_muPsi   = log_inv_logit(muPsi);
log1m_muPsi = log1m_inv_logit(muPsi);

// priors
muP ~ normal(2, .5);
muPsi ~ normal(0, 2);

// likelihood
for (r in 1:n_sampling_units) {
if (y[r] > 0)
target +=
log_muPsi + binomial_logit_lpmf(y[r] | k[r],  muP);
else
target +=
log_sum_exp(log_muPsi +
binomial_logit_lpmf(y[r] | k[r], muP), log1m_muPsi);
}
}
generated quantities{
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;

p   = inv_logit(muP);
psi = inv_logit(muPsi);
}



My generated quantities block so far is:

  vector[n_sampling_units] y_rep;
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;

p   = inv_logit(muP);
psi = inv_logit(muPsi);

for(i in 1:n_sampling_units){
y_rep[i] = bernoulli_rng(psi*p);


Thank you!

1 Like

Check out this thread for an in-depth discussion Posterior predictive distribution for occupancy models - #10 by betanalpha.

3 Likes

Thanks @mbjoseph!

This is very helpful, but I am hoping to also get some help with the actual code as I am new to Stan (and programming in general).

This is complicated stuff! Here are some resources that might help:
Marginalized occupancy models: a how-to for JAGS and Stan. (see especially the very last paragraph in the The marginalized JAGS/BUGS model section).
Occupancy model likelihoods: advanced topics. (see especially the whole thing!)

One major confusion here is that you need to decide two things:

1. What are you trying to posterior-predict? The latent occupancy state? Or the observed detection/nondetection data?
2. Do you want your predictions to condition only on the intercepts and covariates (if you have any)–that is, condition only on the linear predictors–or do you want your predictions to condition twice on the observations? I say “condition twice” because the linear predictors are estimated conditional on the data, but then once we know the linear predictors we can still condition again on the data. This is easiest to see in the event that there is at least one detection at a site. Then regardless of the linear predictor for occupancy {logit}(\psi), we will know with certainty that the true occupancy state Z is one. So do we want to condition our posterior predictions on Z = 1 or on the value of {logit}(\psi)? It turns out that for posterior predictive checking, and likelihood-based model comparison including LOO, it is very important to condition on \psi and NOT on Z = 1. To see why intuitively, notice that at every site with a detection, we know with certainty that Z = 1. If we condition on Z = 1 at all of these sites, then we will universally get the best predictive performance by estimating the lowest possible intercept for logit(\psi). Because at every site without a detection, low values of logit(\psi) maximize the likelihood. And at every site with a detection, logit(\psi) doesn’t matter if we condition on Z = 1. On the other hand, if we want to get the best possible posterior predictions for the latent occupancy state Z, we absolutely should condition in this way. This is all dealt with a bit more neatly at the two resources linked above.

Edit: If we can pin down what exactly you want to predict, I’m happy to help you with the Stan code!

3 Likes

To answer your first question, I believe I am trying to posterior-predict the observed detection/nondetection data. That way I can get a sense of how well the model fits my data. For the second question, I think I agree that we should condition the posterior on the value of logit(ψ).

Thanks for clarifying that! Also, I have two models at the moment: one with no detection or occupancy covariates to simply get at Z at my sites (the model which I shared in my original post), and a model with covariates for both parameters that I hope will elucidate the effects of the covariates and then also predict the Z at my sites. Do you think this is a good structure? Or should I try to combine both these models into one?

Thanks so much for your help!

Best,

If you think that the covariates are relevant and useful, and the model that includes them doesn’t show any problems fitting, then I would abandon the model without covariates.

It sounds like you want to do two different flavors of posterior prediction:

1. Predict Z at your sites.
2. Generate posterior predictions to assess the quality of model fit.

The approach for (2) is actually easier than the approach for (1). For (2), we need to use the generated quantities to sample Z conditional on \psi. So in the generated quantities, you’ll want to compute p and psi just like you’ve done above. Then you’ll want to do something like

  for(i in 1:n_sampling_units){
z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
y_rep[i] = z_rep[i] * binomial_rng(k[i], p[i]);
}


for  (i in 1:n_sampling_units) {
if(y[i] == 1) {
Z_rep[i] = 1;
} else {
real ulo = psi[i] * (1 - p[i])^k[i]; // unnormalized likelihood associated with occupancy
real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
Z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
}
}

1 Like

Thanks, Jacob! Some of the code you provided is definitely new to me.

I get an error in the first if statement (y[i] == 1), that says:

Binary infix operator == with functional interpretation logical_eq requires arguments of primitive type (int or real), found left type=int[ ], right arg type=int.


I think the issue is due to when I first define the new generated quantities. I have:

generated quantities{
real<lower = 0, upper = 1> p;
real<lower = 0, upper = 1> psi;
vector[n_sampling_units] y_rep;
vector[n_sampling_units] z_rep;

// detection and occupancy
p   = inv_logit(muP);
psi = inv_logit(muPsi);

// Predict Z at sites
for  (i in 1:n_sampling_units) {
if (y[i] == 1) {
z_rep[i] = 1;
} else {
real ulo = psi[i] * (1 - p[i])^k[i]; // unnormalized likelihood associated with occupancy
real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
}
}
// generating posterior predictive distribution
for(i in 1:n_sampling_units){
z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
y_rep[i] = z_rep[i] * binomial_rng(k[i], p[i]);
}
}


A bit of clarification: n_sampling_units is my number of sites (n=30). n_surveys is my number of repeat surveys at each site. So I am wondering if the declared dimensions for y_rep and z_rep are correct?

Additionally, I wonder if p and psi are correctly defined? I see that in your code, you are calling them as an i-th term. Does that require that they be vectors?

Thanks so much for your help!

So there are two things where we might be talking past each other.

First, I tucked this into a comment in my Stan snippets, but maybe should have been more upfront about it. I am indexing psi and p by i because you said that you want to include covariates later on. I figured I’d go ahead and write the more general code that can deal with covariates.

Second, the original model that you posted doesn’t have n_surveys but does have k which I think is the same. Your model expects the data to be a single vector of length n_sampling_units giving the counts of detections per site. But based on the error you’re reporting, I wonder if you’ve switched to a model where you pass y as an array of detection/nondetection data where every element is a one or a zero corresponding to a detection or nondetection on a single visit. The genereated quantities code would need modification to play well with a model formulated in this way.

1 Like

Thanks , Jacob!

I was sensing the same thing so I went back to your (amazing) tutorial about marginalizing in Stan and am now using that model code to try and fit my data. In your model, it appears you have two site-level (psi) covariates, and one survey-level (detection) covariate.

One quick question about the code you provided there:

transformed parameters{
real psi[n_unit];
real theta[n_unit, n_rep];
for(i in 1:n_unit){
psi[i] = inv_logit(a0 + a1 * cov1[i] + a2 * cov2[i]);
for(j in 1:n_rep){
theta[i,j] = inv_logit(b0 + b1 * cov1[i] + b2 * cov3[i,j]);
}


I understand what you are doing for psi, but for theta it appears you are also using your first occupancy covariate as one of the linear terms. Wondering why it is there instead of theta[i,j] = inv_logit(b0 + b1 * cov3[i,j]);?

Thank you!

1 Like

In this case I’m imagining that there’s a site-level covariate that influences both occupancy and detection :)

1 Like

Thank you, Jacob.

Now that we are more or less on the same page and referring to the Stan model in your tutorial (Marginalized occupancy models: a how-to for JAGS and Stan), I wonder if you would please help me adapt your code for generating y_rep and z_rep?

At the moment, I think I’m having disagreement with the indexing of det_data. I am unsure how to tell Stan to check the ith row and the jth column:

generated quantities{
matrix[n_unit,n_rep] y_rep;
vector[n_unit] z_rep;

// Predict Z at sites
for  (i in 1:n_unit) {
if (det_data[i] == 1) {
z_rep[i] = 1;
} else {
real ulo = psi[i] * (1 - p[i])^n_rep[i]; // unnormalized likelihood associated with occupancy
real uln = (1 - psi[i]); // unnormalized likelihood associated with non-occupancy
z_rep[i] = bernoulli_rng(ulo / (ulo + uln));
}
}
// generating posterior predictive distribution
for(i in 1:n_unit){
z_rep[i] = bernoulli_rng(psi[i]); // here we index psi by site to deal with covariates later on
y_rep[i] = z_rep[i] * binomial_rng(n_rep[i], p[i]);
}

}


Thanks and sorry if I’m boring you with my basic problems :)

Before you reply, I realize that p should be theta. Should det_data be Q?

Just want to give you a heads up that I won’t likely have time to respond properly on this thread until sometime next week. But you are on the right track. To predict posterior Z you need to implement the branching logic based on whether or not there is at least one detection at a site, which in the tutorial I have called Q.
Just wanted to circle back. I have successfully created a z_rep and it seems to predict my actual site occupancy quite well. I have yet to figure out y_rep and am still struggling with indexing issues. Please let me know if you have any thoughts. Otherwise, I feel like z_rep might be sufficient?